pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
142 stars 44 forks source link

DEID dropping UIDs #259

Open peter-kuzmak opened 9 months ago

peter-kuzmak commented 9 months ago

Hi Everyone,

I am brand new to DEID and very inexperienced with Python. I do have three decades of experience with DICOM though and have used commercial products to do deidentification in the past.

We would like to use DEID in the future.

My first program is from the Getting Started web tutorial. We I try it, the parser deletes the SOP Instance UID (0008,00118), Study Instance UID (0020,000D), Series Instance UID (0020,000E), Instance Number (0020,0013), Media Storage SOP Instance UID (0002,0003) and Reference SOP Instance UID (0008,1155). It did add Patient Identity Removed (0012,0062). I am using a different DICOM object than the ones provided in deid_data.

I have a minimal recipe:

FORMAT dicom

%header

KEEP SOPInstanceUID ADD PatientIdentityRemoved YES

The web example retains the same UIDs.

Any idea why the UIDs are been dropped?

Thanks!

Peter Kuzmak, Department of Veterans Affairs

vsoch commented 9 months ago

hey @peter-kuzmak ! I have limited bandwidth to help on this, but if you can share a dummy dicom file and the dicom.deid and the exact way to reproduce your use case, I can likely make some time over a weekend. Thanks!

wetzelj commented 9 months ago

Hi @peter-kuzmak,

While I have a good amount of experience with this library, I've never tried to follow the getting started guide. Based on the results your seeing, I suspect that your recipe file is getting applied in addition to the "stock" deid.dicom (https://github.com/pydicom/deid/blob/master/deid/data/deid.dicom). In the stock version, you'll see two lines that would be responsible for removing any UIDs from the header (REMOVE endswith:ID and REMOVE contains:UID).

In my application I utilize the replace_identifiers function for header manipulation - and specify the recipes in the parameters to this function.

Like @vsoch, I don't have a great deal of bandwidth but am happy to chime in with input. It's always great to see others using the library!

peter-kuzmak commented 9 months ago

Hi Venessa and James,

I am still at a loss. I have changed /deid/data/deid.dicom to comment out the REMOVE's. I still get the UIDs removed.

I am using replace_identifiers for header manipulation. I have attached the files.

Thanks for your help!

Peter kuzmak.zip

wetzelj commented 9 months ago

Hi @peter-kuzmak,

Looking at deid.dicom-kuzmak;

FORMAT dicom

%values cookie_names
SPLIT PatientID by="^";minlength=4

%values operator_names
FIELD startswith:Operator

%values patient_name
FIELD PatientName

%values patient_id
FIELD PatientID

%fields instance_fields
FIELD contains:Instance

%header

ADD PatientIdentityRemoved YES
REPLACE values:cookie_names var:id
REPLACE values:operator_names var:source_id
REMOVE fields:instance_fields
ADD SOPInstanceUID 1.2.3.4.5.6.7.8.9.10

Your issue seems to be the "instance_fields" fields list. By creating a fields list in this way you're creating a group of all UID fields (Every DICOM Field where the field name contains "Instance"). By using "contains:Instance" you're capturing StudyInstanceUID, SeriesInstanceUID, InstanceNumber, etc. Further down in your recipe, you have REMOVE fields:instance_fields which is going to then direct deid to remove the fields identified in the fields list.

If you remove the the following lines from your recipe, the UIDs should be retained:

%fields instance_fields
FIELD contains:Instance

REMOVE fields:instance_fields

Hopefully this helps. If not, we'll go from there! :)

wetzelj commented 9 months ago

You also may be overusing %values lists... depending on what you're trying to do.

%values operator_names
FIELD startswith:Operator

With this command, the operator_names values list will contain every tag that has a VALUE that is any field that starts with Operator. For example, let's say tag (0008,1070) Operator's Name contains a value of "SIMPSON". With a values list, all DICOM header tags will be scanned for a value of "SIMPSON". If this is contained within the tag's value, it's added to the values list. In other words, if the (0010,0010) Patient Name tag contains "SIMPSON^LISA", it would be identified by the values list.

With this example, your command REPLACE values:operator_names var:source_id would end up replacing both (0008,1070) and (0010,0010) with the value in source_id.

If you're really just trying to replace fields starting with Operator with the value in source_id, you could eliminate the values list all together and just go with single directive:

REPLACE startswith:Operator var:source_id

In my experience values lists are mostly beneficial for removals. My primary use of values lists centers around removing private tags that contain PHI. The way I use a values list is as follows:

%values known_phi
FIELD AccessionNumber
FIELD PatientID
FIELD PatientBirthDate
SPLIT PatientName by="^";minlength=4

The known_phi values list gets us a list of every tag that contains a value that is one of the identifiers in the list for the particular study.

This is followed with:

REMOVE  values:known_phi
peter-kuzmak commented 9 months ago

Hi James,

I found my problem.

I had this in the recipe:

REMOVE fields:instance_fields

I commented it out and now it works fine.

Thanks!

Peter

Peter M. Kuzmak IT Specialist, Software Engineering Department of Veterans Affairs Home Office (100% Telecommute) OI&T DevSecOps, SPM Health, Clinical Services, and Diagnostics VistA Imaging Project (443) 937-2275 (work) 1485 Keysville Road South (410) 925-9395 (cell) Keymar, MD 21757-8777 @. (VA email) http://www.va.gov/imaging/ @. (home email) Mon-Thu 7:00 am - 4:30pm (ET) Fri 7:00 am - 3:30 pm, off second Friday of pay period

In times like these, remember our country's motto: "In God We Trust". In Him is our comfort and strength

For what will it profit a man if he gains the whole world, and loses his own soul?

From: James Wetzel @.> Sent: Thursday, December 14, 2023 2:46 PM To: pydicom/deid @.> Cc: Kuzmak, Peter M @.>; Mention @.> Subject: [EXTERNAL] Re: [pydicom/deid] DEID dropping UIDs (Issue #259)

You also may be overusing %values lists... depending on what you're trying to do.

%values operator_names

FIELD startswith:Operator

With this command, the operator_names values list will contain every tag that has a VALUE that is any field that starts with Operator. For example, let's say tag (0008,1070) Operator's Name contains a value of "SIMPSON". With a values list, all DICOM header tags will be scanned for a value of "SIMPSON". If this is contained within the tag's value, it's added to the values list. In other words, if the (0010,0010) Patient Name tag contains "SIMPSON^LISA", it would be identified by the values list.

With this example, your command REPLACE values:operator_names var:source_id would end up replacing both (0008,1070) and (0010,0010) with the value in source_id.

If you're really just trying to replace fields starting with Operator with the value in source_id, you could eliminate the values list all together and just go with single directive:

REPLACE startswith:Operator var:source_id

In my experience values lists are mostly beneficial for removals. My primary use of values lists centers around removing private tags that contain PHI. The way I use a values list is as follows:

%values known_phi

FIELD AccessionNumber

FIELD PatientID

FIELD PatientBirthDate

SPLIT PatientName by="^";minlength=4

The known_phi values list gets us a list of every tag that contains a value that is one of the identifiers in the list for the particular study.

This is followed with:

REMOVE values:known_phi

- Reply to this email directly, view it on GitHubhttps://github.com/pydicom/deid/issues/259#issuecomment-1856478568, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AP5DU7CP4HFE7XEUZL3BF23YJNJPRAVCNFSM6AAAAABAR7RPGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJWGQ3TQNJWHA. You are receiving this because you were mentioned.Message ID: @.***>

peter-kuzmak commented 9 months ago

Hi James,

Thanks for your help! I have been using DEID for only two days!

I have been responsible for the DICOM development in the VA for three decades. We have ~20 billion DICOM objects.

I wrote a DICOM SDK in MUMPS in 1995/6. It has stood the test of time.

I attached two papers on de-identification that I wrote a couple of years ago. I used Laurel Bridge COMPAS for that.

I the future I hope to use DEID.

Thanks!

Peter

Peter M. Kuzmak IT Specialist, Software Engineering Department of Veterans Affairs Home Office (100% Telecommute) OI&T DevSecOps, SPM Health, Clinical Services, and Diagnostics VistA Imaging Project (443) 937-2275 (work) 1485 Keysville Road South (410) 925-9395 (cell) Keymar, MD 21757-8777 @. (VA email) http://www.va.gov/imaging/ @. (home email) Mon-Thu 7:00 am - 4:30pm (ET) Fri 7:00 am - 3:30 pm, off second Friday of pay period

In times like these, remember our country's motto: "In God We Trust". In Him is our comfort and strength

For what will it profit a man if he gains the whole world, and loses his own soul?

From: James Wetzel @.> Sent: Thursday, December 14, 2023 2:46 PM To: pydicom/deid @.> Cc: Kuzmak, Peter M @.>; Mention @.> Subject: [EXTERNAL] Re: [pydicom/deid] DEID dropping UIDs (Issue #259)

You also may be overusing %values lists... depending on what you're trying to do.

%values operator_names

FIELD startswith:Operator

With this command, the operator_names values list will contain every tag that has a VALUE that is any field that starts with Operator. For example, let's say tag (0008,1070) Operator's Name contains a value of "SIMPSON". With a values list, all DICOM header tags will be scanned for a value of "SIMPSON". If this is contained within the tag's value, it's added to the values list. In other words, if the (0010,0010) Patient Name tag contains "SIMPSON^LISA", it would be identified by the values list.

With this example, your command REPLACE values:operator_names var:source_id would end up replacing both (0008,1070) and (0010,0010) with the value in source_id.

If you're really just trying to replace fields starting with Operator with the value in source_id, you could eliminate the values list all together and just go with single directive:

REPLACE startswith:Operator var:source_id

In my experience values lists are mostly beneficial for removals. My primary use of values lists centers around removing private tags that contain PHI. The way I use a values list is as follows:

%values known_phi

FIELD AccessionNumber

FIELD PatientID

FIELD PatientBirthDate

SPLIT PatientName by="^";minlength=4

The known_phi values list gets us a list of every tag that contains a value that is one of the identifiers in the list for the particular study.

This is followed with:

REMOVE values:known_phi

- Reply to this email directly, view it on GitHubhttps://github.com/pydicom/deid/issues/259#issuecomment-1856478568, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AP5DU7CP4HFE7XEUZL3BF23YJNJPRAVCNFSM6AAAAABAR7RPGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJWGQ3TQNJWHA. You are receiving this because you were mentioned.Message ID: @.**@.>>

wetzelj commented 9 months ago

Glad to help! There's a lot of functionality in the library and takes some time to get used to it all.

peter-kuzmak commented 8 months ago

Hi James,

I resume my work on DEID this week after Christmas vacation and other commitments.

We have around 60k retinal studies with ~500k images that need to be de-identified. Some of the images have burned in PHI (study date) that needs to be removed.

I have been using parser.parse(), rather than replace_identifiers() because I need to get the Accession Number (0008,0050) for each image prior to de-identification.

It takes about a second to de-identify a DICOM retinal image.

I have three questions

  1. I have a rule that encrypts UIDs. It works on Study Instance UID (0020,000D). It does not work on Media Storage SOP Instance UID(0002,0003), although it is in the recipe.
  2. I can't figure out how to de-identify sequences. They seem to be passed untouched.
  3. How do you incorporate the pixel cleaning? I thought it would be done automatically as part of parser.parse().

Any answers to these questions and additional guidance would be very much appreciated.

Thanks!

Peter

Peter M. Kuzmak IT Specialist, Software Engineering Department of Veterans Affairs Home Office (100% Telecommute) OI&T DevSecOps, SPM Health, Clinical Services, and Diagnostics VistA Imaging Project (443) 937-2275 (work) 1485 Keysville Road South (410) 925-9395 (cell) Keymar, MD 21757-8777 @. (VA email) http://www.va.gov/imaging/ @. (home email) Mon-Thu 7:00 am - 4:30pm (ET) Fri 7:00 am - 3:30 pm, off second Friday of pay period

In times like these, remember our country's motto: "In God We Trust". In Him is our comfort and strength

For what will it profit a man if he gains the whole world, and loses his own soul?

From: James Wetzel @.> Sent: Thursday, December 14, 2023 3:25 PM To: pydicom/deid @.> Cc: Kuzmak, Peter M @.>; Mention @.> Subject: [EXTERNAL] Re: [pydicom/deid] DEID dropping UIDs (Issue #259)

Glad to help! There's a lot of functionality in the library and takes some time to get used to it all.

- Reply to this email directly, view it on GitHubhttps://github.com/pydicom/deid/issues/259#issuecomment-1856531726, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AP5DU7EFAXRHPRP73GVISETYJNODTAVCNFSM6AAAAABAR7RPGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJWGUZTCNZSGY. You are receiving this because you were mentioned.Message ID: @.**@.>>

wetzelj commented 8 months ago

Hi Peter,

Happy New Year! Regarding your questions:

  1. I have a rule that encrypts UIDs. It works on Study Instance UID (0020,000D). It does not work on Media Storage SOP Instance UID(0002,0003), although it is in the recipe.

Can you double check your recipe on this? While I don't have personal experience targeting this field, it was a later addition to the library to enable targeting of meta fields and to that end, there is a specific unit test which covers replacing MediaStorageSOPInstanceUID. Do you have a specific recipe rule that targets MediaStorageSOPInstanceUID by name or are you trying to rely on an expander? Your issue makes me wonder if perhaps the expanders aren't including the meta fields.

  1. I can't figure out how to de-identify sequences. They seem to be passed untouched.

Again, this isn't an area that I've explored in great detail. Typically in my projects we either remove sequences entirely or find that using a values list and then performing a REMOVE based on the values list clears the desired information from the sequence. Can you provide samples of your recipe and the headers?

  1. How do you incorporate the pixel cleaning? I thought it would be done automatically as part of parser.parse().

The pattern that we follow is to first create a DicomCleaner object and then run the detect method on each file being deidentified. The DicomCleaner.detect() will provide an output result which determines which filters apply to the file (blacklist or graylist). If a blacklist filter was detected, you can simply throw away the file. If a greylist filter was detected, then you need to call DicomCleaner.clean() to actually perform the pixel manipulation followed by DicomCleaner.save_dicom() to save the result file.

Only after the above is completed do we call replace_identifiers(). This enables us to ensure that we're only performing replace_identifiers if file actually needs to be included in the result set. We're not processing the identifier replacement on blacklisted files.

There might be better ways to do this now with Parser.parse. The process that I've outlined above is some of our earliest code using deid... honestly, I think it was written before parse was added to the library.

@vsoch - Do you have an thoughts on @peter-kuzmak's questions?

vsoch commented 8 months ago

We have around 60k retinal studies

I would want to get the machine information and double check we have it included.

@wetzelj is spot on for his answers. My suggestion would be to get us a dummy image that reproduces your issue so we can test it and either help to demonstrate a solution or put in a fix.