pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
140 stars 43 forks source link

Fix for #165 - Ultrasound Pixel Deidentification #166

Closed wetzelj closed 3 years ago

wetzelj commented 3 years ago

Description

Related issues: #165

Modified pixel cleaning to accommodate various dimensions of stored pixels encountered when cleaning ultrasounds. Depending on the study, the pixel data array may be:

This commit includes a fix as well as tests for each of the above conditions.

Checklist

Open questions

The pixel data in the test images that I've uploaded with this PR have not been modified from the way they were sent from the modality. It was a struggle to find a RGB cine clip that was under the 100Mb file size limitation imposed by github. While the sample I provided was 99.6Mb (as reported by Win10), github detected this as >100Mb. To circumvent this, I added code to utils.py - get_files to detect and decompress files. It's not great, however, I'd like to open an issue to tackle test image storage as a future effort. It might be beneficial to compress everything for upload and decompress the images prior to use in the tests. With this proposed new issue, I believe that I have an opportunity to build out our test image data set with a larger variety of PHI-safe test images from a variety of makes/models of scanners. Our department has an annual employee art show in which we can image (XR, CT, US) objects to serve as the basis for works of art. Given that these images may not contain any body part and are tracked only to give back to the employee for use in creating their art work, they are not in any way tied to any patient or demographics. If you think this data set would be beneficial, I can start to work on the logistics and approvals for sharing this data. Currently the majority of the images are XR.

vsoch commented 3 years ago

With this proposed new issue, I believe that I have an opportunity to build out our test image data set with a larger variety of PHI-safe test images from a variety of makes/models of scanners. Our department has an annual employee art show in which we can image (XR, CT, US) objects to serve as the basis for works of art.

I really like this idea. And actually, we can fairly easy separate the small example data (e.g., dicom cookies) from those that are just used for testing, which arguably shouldn't even be stored in the repository because they are so large. I think what would make sense to do is follow the convention of pydicom-data, and even first ask if there is room to accept files for deid here to that library (so pydicom can have a single, shared data library). @darcymason would this be something we could consider? If we want to keep the libraries separate, and you'd want to make a new dataset not just for deid but for other dicom enthusiasts then we could also have it branded differently and then used by the library here. Either way, we could definitely have a function to decompress after install.

wetzelj commented 3 years ago

Hey @vsoch, would you object to merging this PR to fix the ultrasound issue in advance of the potential changes for the test images?

vsoch commented 3 years ago

Is there a rush? And just to clarify - the cine clips added are all under 10MB?

I think my preference would be to figure out how to provide the images externally (and retrieve on test) - and we could even do that simply with putting them in a separate repository, and then downloading. My worry is that I would merge, and then you wouldn't follow up on doing that.

wetzelj commented 3 years ago

The rush is just that we're trying to deidentify a specific ultrasound which caused us to find this error. Obviously I can deploy from my fork to get around this, but the bug in question is really an extension of PR #134 and I'd like to fix this for other users as well.

The files that I've added are all under 10Mb - the two cine clips are compressed, uploaded as .zip files (the uncompressed dcm files are not <10Mb). I introduced a change into utils.py to allow for test files to be uploaded as .zip and decompressed into the temp directory for testing. This was done simply to enable the bug fix to be implemented with tests. If you would prefer, I can simply remove the cine clip tests from this commit.

I'm more than happy to work on the changes for the test files, but before I go down that path I want to ensure that I completely understand how you would want this to be implemented. Once the test file repository is established I would be happy to modify the tests to utilize it, but ultimately don't feel that a bugfix should be held up by a refactoring of the unit tests.

Please understand, with almost every change I've requested and or corrected, I've tried to implement corresponding tests to improve the project as a whole - but at the same time, my first priority has to be to my end user requests and satisfying their requirements and requests.

vsoch commented 3 years ago

Please understand, with almost every change I've requested and or corrected, I've tried to implement corresponding tests to improve the project as a whole - but at the same time, my first priority has to be to my end user requests and satisfying their requirements and requests.

I totally understand, you're completely fantastic @wetzelj ! If the images were still 100MB I would have larger issue, but since they are closer to 10MB I'd be happy to merge with the current fixes, because it puts the library in a better place. Thanks for clarifying, stay tuned!

vsoch commented 3 years ago

https://pypi.org/project/deid/0.2.24/

And for the scenario I have in mind, I am thinking something akin to what pydicom does. It looks like they have a data manager class and then the exact urls.