pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
140 stars 43 forks source link

Add support to read and manipulate file meta #184

Closed vsoch closed 2 years ago

vsoch commented 3 years ago

This is the start of work to read/parse file_meta to address #183. I made some time this evening and I think the implementation was easier than I anticipated, but I have a few questions / concerns to address!

Remaining questions:

Do file meta fields overlap with fields in the dicom?

The simplest implementation I could do is to look for another section in the recipe, filemeta, that gets parsed just like the header section. I simply create DicomField with is_filemeta = True. However, I realized that if the identifiers you find in file_meta are unique to that (and vice versa with the dicom) then we don't need to check this value - we can simply replace/add/ etc based on the unique identifier just as we would. So this is how I implemented it to start. However, if it's the case that the same Field can appear in file meta OR the regular dicom, then we'll need to add flag actions that are just for filemeta, and add checks all over the place that essentially say:

# Process if it's an action for file meta, and this is a file meta field
if is_filemeta_action and field.is_filemeta:
   # process

# Process if it's an action not for filemeta, and this is a regular header field
elif not is_filemeta and not field.is_filemeta:
   # process

# otherwise not

TLDR: If fields are not unique to the sections we need to be checking each field if it is a filemeta or not before any action is taken. As implemented now, the fields are parsing over equivalently and it is assumed that file meta fields do not appear in the dicom and vice versa.

Do we need to make any special changes to the file meta, e.g., perhaps the size?

I have never manipulated it before, and I am hoping that pydicom save handles these changes, but if not we should do it manually. We can check with some of the other maintainers of the pydicom org that know pydicom well if we aren't sure here, and after we've addressed the point above.

cc @wetzelj and @adildahlan for your feedback!

Signed-off-by: vsoch vsoch@users.noreply.github.com

Description

Related issues: # (issue)

Please include a summary of the change(s) and if relevant, any related issues above.

Checklist

Open questions

See above!

adildahlan commented 3 years ago

Hi @vsoch,

Many thanks for working on this very quickly!

I like the idea of having a section within the deid.dicom recipe file. Just like we have %Header section (https://pydicom.github.io/deid/examples/recipe/), maybe we could have something like %File_meta section for the file meta. My understanding is that the attributes (data elements) within the dicom file meta are different to those of the dicom file. Here is a link to the official page of DICOM files where all the attributes within the file meta are listed (of course not all dicom files meta have all these attributes, but it could be). NB: All the dicom file meta have a tag starting with 0002.

My intuition would be that we wouldn't require any changes to be done to the file meta unless the user has that as part of the deid.dicom recipe. I am not sure TBH about that...

Many thanks again!

Regards

Adil

vsoch commented 3 years ago

%File_meta section for the file meta.

Yes that's correct! The PR here adds %filemeta (take a look at the example https://github.com/pydicom/deid/blob/39d8e019c1d3c85047a2c5a20b514d141d6e8535/examples/dicom/header-manipulation/file-meta/deid.dicom)

My understanding is that the attributes (data elements) within the dicom file meta are different to those of the dicom file. Here is a link to the official page of DICOM files where all the attributes within the file meta are listed (of course not all dicom files meta have all these attributes, but it could be). NB: All the dicom file meta have a tag starting with 0002.

If this is the case, then my strategy was right on! Do you want to test the branch here?

The last remaining bit is if we need to do something special for save. Let's wait for @wetzelj feedback then ping some of the other maintainers to ask.

wetzelj commented 3 years ago

I also like the idea of having a separate %file_meta section. We would, however, probably want to ensure that standard %header actions don't overlap and act against filemeta attributes (thinking of expanders and %values based rules).

I don't think we would want a recipe set up as below to remove the MediaStorageSOPInstanceUID that it just replaced:

%filemeta
REPLACE MediaStorageSOPInstanceUID new-id

%header
REMOVE endswith:UID 

Instead, if we did want to remove fields ending with UID from both sections of the file, we'd want to build a recipe like:

%filemeta
REMOVE endswith:UID

%header
REMOVE endswith:UID 
adildahlan commented 3 years ago

Yes that's correct! The PR here adds %filemeta (take a look at the example https://github.com/pydicom/deid/blob/39d8e019c1d3c85047a2c5a20b514d141d6e8535/examples/dicom/header-manipulation/file-meta/deid.dicom)

Oh yes that's perfect!

adildahlan commented 3 years ago

%filemeta REPLACE MediaStorageSOPInstanceUID new-id

%header REMOVE endswith:UID


Instead, if we did want to remove fields ending with UID from both sections of the file, we'd want to build a recipe like:

%filemeta REMOVE endswith:UID

%header REMOVE endswith:UID

I am wondering why do we need to have REMOVE endswith:UID as part of the recipe?

wetzelj commented 3 years ago

That was just a sample to indicate the way we would/would not want the actions to be applied - that wasn't anything we'd really want to do. :)

vsoch commented 3 years ago

That’s a good point - but @wetzelj to go in the other direction if fields are unique why even have the second section? A rule to replace all ending with UID should hit those in file meta, why should they be special? But I do think we perhaps want to protect some of the fields like the size and transfer syntax from change?

adildahlan commented 3 years ago

That was just a sample to indicate the way we would/would not want the actions to be applied - that wasn't anything we'd really want to do. :)

Oh I see, that makes sense

adildahlan commented 3 years ago

That’s a good point - but @wetzelj to go in the other direction if fields are unique why even have the second section? A rule to replace all ending with UID should hit those in file meta, why should they be special? But I do think we perhaps want to protect some of the fields like the size and transfer syntax from change?

My initial intuitive understanding was that because the data elements part of the file_meta are inherently separate for the dicom file data elements, it would probably make sense to have a seperate section for them within the recipe (to access the dicom data elements it is enough to do ds.DATAELEMENT whereas to access file_meta data elements, we would need to do ds.file_meta.DATAELEMENT) I don't have a strong opinion on this to be honest.

wetzelj commented 3 years ago

I guess you're right. It probably doesn't make much sense to split it out. My statement was coming from my line of thinking that filemeta fields were special, and should be somewhat protected from "bad"/overly wide recipes... but really, that's not the case.

Honestly, I'm with @adildahlan and don't have a strong opinion on this either.

vsoch commented 3 years ago

(to access the dicom data elements it is enough to do ds.DATAELEMENT whereas to access file_meta data elements, we would need to do ds.file_meta.DATAELEMENT) I don't have a strong opinion on this to be honest.

We can actually handle this easily because the dicom.file_meta is just another kind of dataset, so we add it as a dataset to parse (and save fields for). Since the fields we add to our lookup data structure are the same ones that point back into the dicom and they are accessed easily, we can actually just parse them akin to any other header.

Okay so here is my thoughts for the recipe:

  1. I think file_meta fields should just work, and not have their own section. It makes sense only up until we consider statements like REPLACE endswith:uid and in this case I think we either have to only parse one of the header or file_meta (and add a lot of custom logic that is going to slow things down) OR allow the parsing to be seamless, and to treat the file meta fields just like the header fields. I actually think it's more risky that people miss file meta fields with unique ids that maybe should be replaced because they have to specify it twice. So I want to remove the filemeta section.
  2. Given the point above and looking here we need to protect some fields in the file_meta. I think we'd want to protect:
    FileMetaInformationGroupLength
    FileMetaInformationVersion
    TransferSyntaxUID
    ImplementationClassUID

So those are my thoughts for moving forward - remove the special section and treat the fields like any other, and then protect ones that shouldn't be changed. Let me know what you think!

adildahlan commented 3 years ago

So those are my thoughts for moving forward - remove the special section and treat the fields like any other, and then protect ones that shouldn't be changed. Let me know what you think!

Sounds perfect!

wetzelj commented 3 years ago

Agreed. This solution will be great!

vsoch commented 3 years ago

okay fantastic! I'll get in a PR after the end of the work day. Thanks for the quick, fun discussion y'all!

vsoch commented 3 years ago

Okay @wetzelj @adildahlan this should be ready for review! Let me know if there are tests you think we should add, or if your use case is not addressed.

adildahlan commented 2 years ago

Hi @vsoch ! Many thanks for working on this. So I created a new conda environment and installed the current branch of the died library using the following code: pip install git+https://github.com/pydicom/deid.git@add/support-file-meta

And I have my recipe as follows:


%header

REPLACE PatientName var:new_PatientName
REPLACE PatientID var:new_PatientID
ADD SOPInstanceUID var:new_SOPInstanceUID
ADD MediaStorageSOPInstanceUID var:new_MediaStorageSOPInstanceUID

And in my code I define each of these new variables as done in. this example code https://github.com/pydicom/deid/blob/master/examples/dicom/recipe/deid-dicom-example.py

However, I am getting this error image

I hope you could help me figure out what's wrong

Many thanks!

Adil

vsoch commented 2 years ago

Oh interesting, looks like an issue with save. Can you share with me a dummy example to reproduce the issue? Are you just using the provided cat.dcm but updated the deid recipe?

adildahlan commented 2 years ago

Oh interesting, looks like an issue with save. Can you share with me a dummy example to reproduce the issue? Are you just using the provided cat.dcm but updated the deid recipe?

Hi @vsoch !

Oh I see. I am using the data I downloaded from the following link.

I would need to check if I am allowed to share the code as it is a private project I am collaborating on. Can you please try running your script with the dataset in the link above? If you don't get the same error maybe then I can check with my supervisor if I can share bits of the code with you. Apologies about that...

Many thanks!

vsoch commented 2 years ago

it's okay I was able to reproduce on save:

In [3]: dicom.save_as('test.dcm')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/Desktop/Code/deid/examples/dicom/header-manipulation/filemeta/example.py in <module>
----> 1 dicom.save_as('test.dcm')

~/Desktop/Code/deid/env/lib/python3.8/site-packages/pydicom/dataset.py in save_as(self, filename, write_like_original)
   1885             Write a DICOM file from a :class:`FileDataset` instance.
   1886         """
-> 1887         pydicom.dcmwrite(filename, self, write_like_original)
   1888 
   1889     def ensure_file_meta(self) -> None:

~/Desktop/Code/deid/env/lib/python3.8/site-packages/pydicom/filewriter.py in dcmwrite(filename, dataset, write_like_original)
    969     #   place
    970     if dataset.group_dataset(0x0002) != Dataset():
--> 971         raise ValueError(
    972             f"File Meta Information Group Elements (0002,eeee) should be in "
    973             f"their own Dataset object in the "

ValueError: File Meta Information Group Elements (0002,eeee) should be in their own Dataset object in the 'FileDataset.file_meta' attribute.

Let me take a look, I might need to ask for help from the other maintainers of pydicom on this one!

adildahlan commented 2 years ago

it's okay I was able to reproduce on save:

In [3]: dicom.save_as('test.dcm')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/Desktop/Code/deid/examples/dicom/header-manipulation/filemeta/example.py in <module>
----> 1 dicom.save_as('test.dcm')

~/Desktop/Code/deid/env/lib/python3.8/site-packages/pydicom/dataset.py in save_as(self, filename, write_like_original)
   1885             Write a DICOM file from a :class:`FileDataset` instance.
   1886         """
-> 1887         pydicom.dcmwrite(filename, self, write_like_original)
   1888 
   1889     def ensure_file_meta(self) -> None:

~/Desktop/Code/deid/env/lib/python3.8/site-packages/pydicom/filewriter.py in dcmwrite(filename, dataset, write_like_original)
    969     #   place
    970     if dataset.group_dataset(0x0002) != Dataset():
--> 971         raise ValueError(
    972             f"File Meta Information Group Elements (0002,eeee) should be in "
    973             f"their own Dataset object in the "

ValueError: File Meta Information Group Elements (0002,eeee) should be in their own Dataset object in the 'FileDataset.file_meta' attribute.

Let me take a look, I might need to ask for help from the other maintainers of pydicom on this one!

Oh perfect! Many thanks!

Temperarly, I found a way to go around the issue with anonymising the MediaStorageSOPInstanceUID. I did so by running the anonymisation script without having in the recipe the MediaStorageSOPInstanceUID in it. Then afterwards I loop through all the anonymised dicom files and if the MediaStorageSOPInstanceUID dataelement doesn't match the SOPInstanceUID then I change the former to the later and save the dicom file as shown below.

if 'var:new_SOPInstanceUID' in new_variables:
        # Read in the new anonymised dicom files
        dicom_files_path = list(get_files(output_directory, pattern='*.dcm'))
        # Dictionary of dicom directories and header dictionary
        dicoms_dict = get_identifiers(dicom_files_path)
        # Loop through the new anonymised dicom files
        for i, (dicom_path, dicom_header) in enumerate(dicoms_dict.items()):
            # Read the anonymised dicom file
            ds = pydicom.dcmread(dicom_path)
            # Check if the MediaStorageSOPInstanceUID wasn't anonymised
            if ds.file_meta.MediaStorageSOPInstanceUID != ds.SOPInstanceUID:
                # Anonymise the MediaStorageSOPInstanceUID
                ds.file_meta.MediaStorageSOPInstanceUID = ds.SOPInstanceUID
                # Add the new MediaStorageSOPInstanceUID to the MediaStorageSOPInstanceUID_old_new_pair
                MediaStorageSOPInstanceUID_old_new_pair[ds.file_meta.MediaStorageSOPInstanceUID] = ds.SOPInstanceUID
                # Save the anonymised dicom file
                ds.save_as(dicom_path)

This worked so I guess saving the dicoms file after changing the MediaStorageSOPInstanceUID works but probably not with having MediaStorageSOPInstanceUID in the recipe.

Many thanks for all your help!

vsoch commented 2 years ago

hey @adildahlan so I figured it out! What was happening is that for any action that uses the add functionality, the assumption was adding to the dicom directly as a field. Of course with file meta this is problematic because we need to save to the filemeta instead. So the fix was to restore preserving the boolean if something was file meta, and then taking a different action in the parser depending on if it was file meta or not. Take a look at the changes here: https://github.com/pydicom/deid/pull/184/commits/a7f3f2012a16bf193e85a0e8207fe63f5ce03fc0 and when tests pass give the updated branch a try.

adildahlan commented 2 years ago

hey @adildahlan so I figured it out! What was happening is that for any action that uses the add functionality, the assumption was adding to the dicom directly as a field. Of course with file meta this is problematic because we need to save to the filemeta instead. So the fix was to restore preserving the boolean if something was file meta, and then taking a different action in the parser depending on if it was file meta or not. Take a look at the changes here: a7f3f20 and when tests pass give the updated branch a try.

Hi @vsoch, it is working!!! Many thanks for having worked on this issue very quickly. You have done a great job at building this library and maintaining it 👏🏼! Wish you all the best! Warm regards Adil

vsoch commented 2 years ago

Woot! Let's merge and release then.

vsoch commented 2 years ago

https://pypi.org/project/deid/0.2.26/