pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
145 stars 44 forks source link

Editing Dicom Preamble #72

Closed Biggz1313 closed 5 years ago

Biggz1313 commented 5 years ago

Hello, I'll preface that I've only been learning python since March but have come a long way. I am using Pydicom and Deid to do some mass deidentification of dicom files and I noticed when I check the files, all fields I wanted changed in the header are changing, but the Media Storage SOP Class UID and Media Storage SOP Instance UID in the preamble are not changing. The SOP Class UID isn't that big of a deal because it's just an image type identifier, but more often than not, the Media Storage SOP Instance UID is just a copy of the actual SOP Instance UID with is PHI that needs removed. Is there a way to alter some code to get the Deid process to also change fields in the preamble as well? Thank you in advance for any help or guidance you can provide.

vsoch commented 5 years ago

Hey @Biggz1313 I can definitely look into this! I'll see if the files that I have handy are enough to test, and ping you if I might need some help to find ones. I remember the preamble was a bit tricky to deal with, so likely I missed seeing an example with PHI in the header. Stay tuned!

Biggz1313 commented 5 years ago

Awesome! Thank you so much for the quick response! I see in the header.py file that there is some code about keeping preamble the same, but I get lost trying to trace back what all it's doing (it's around line 315). Ideally I'd like to replace the MediaStorageSOPInstanceUID with the same var that I'm replacing the SOPInstanceUID with.

vsoch commented 5 years ago

Are you able to show me a few examples of the preambles that you see (without the PHI of course?) I want to get a sense for what winds up in there. It seems to be the case that the field is unstructured and not required for machines to read, so possibly we would do better to just set to the equivalent of "empty" something like the bytes 00H ?

vsoch commented 5 years ago

Lord, I didn't even write this that long ago and the code looks so messy! This always happens... haha.

Biggz1313 commented 5 years ago

Attached is a screenshot of a file that I ran through my script. I'm using Rubo Parser to parse the Dicom. I can see the (0002, xxxx) groups in this parser as well as if I use Siemens Syngo fastView to open the images, then view the dicom header.

And I know all about messy code, haha. Like I said, I'm a beginner, and my scripts work, but boy are they ugly haha.

deid_example

Biggz1313 commented 5 years ago

Just reread one of you replies, and yes, I would be happy if the field was just blanked. I was just going to replace it with the SOPInstanceUID because I think most VNAs out there will use that for storage purposes so I figured if I could replace it with the value I was already creating that would be ideal, but in all honesty, if it's just blanked, I'll be happy.

vsoch commented 5 years ago

Okay, so if we look here at the creation of an "empty" preamble it's just setting it to an empty bytes string. But I want to clarify something - the preamble is that bytes string, but what you seem to be talking about is the file meta data (with tag starting with 0002) which is different from the preamble. Is this correct? For example, here is metadata:

(0002, 0000) File Meta Information Group Length  UL: 198
(0002, 0001) File Meta Information Version       OB: b'\x00\x01'
(0002, 0002) Media Storage SOP Class UID         UI: Secondary Capture Image Storage
(0002, 0003) Media Storage SOP Instance UID      UI: xxxxxxxxxxxxxxxxxxxxxxxx
(0002, 0010) Transfer Syntax UID                 UI: JPEG Baseline (Process 1)
(0002, 0012) Implementation Class UID            UI: 1.2.276.0.7230010.3.0.3.6.1
(0002, 0013) Implementation Version Name         SH: 'OFFIS_DCMTK_361'

and the preamble is literally just a bytes string:

 dcm.preamble
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

So you mean to say that the meta data needs some additional tweaking (and the preamble too?) Where in the above do you see the preamble?

Biggz1313 commented 5 years ago

Sorry, yes, the meta data is what needs tweaking. I made an incorrect assumption that the preamble contained the meta data, but I just read something that cleared that up for me after you pointed it out. So really it's just the meta data that needs edited (specifically field (0002, 0003) ) as far as I can tell now. Does that make it easier?

vsoch commented 5 years ago

Yes definitely clear! It looks like the 0002 header fields are used for easier reading, see :point_right: https://stackoverflow.com/questions/32689446/is-it-true-that-dicom-media-storage-sop-instance-uid-sop-instance-uid-why What I can do is to modify the code to blank it, and then give you a branch to test. We would want to see if any of your viewers have issue with opening the file, etc. Give me a few minutes and I'll put together something to test!

Biggz1313 commented 5 years ago

that would be fantastic, thank you!

vsoch commented 5 years ago

okay I'm done, but I'm taking a few more minutes to clean up the docstrings because it will drive me nuts if I don't :P

vsoch commented 5 years ago

@Biggz1313 see https://github.com/pydicom/deid/pull/73 I'll be afk for a bit, but back after you've had a chance to test. Notes for the PR are in the description. Hopefully it's a step in the right direction!

vsoch commented 5 years ago

Changes released with https://pypi.org/project/deid/0.1.19/