pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
140 stars 43 forks source link

Refactor of deid to add parser #127

Closed vsoch closed 4 years ago

vsoch commented 4 years ago

Description

This is a large refactor of deid to address several issues:

Related issues:

Open questions

I'd like (when tests pass) for @wetzelj to be able to take a look and see if this can be a good fit for his use case. Specifically, it should be much easier to load a recipe, define variables, functions, and custom lists for it, and then run over a dicom file.

vsoch commented 4 years ago

@howardpchen you might be interested in this work as well - I'd like to address the issues that you were having earlier, if you have time to comment / take a look!

wetzelj commented 4 years ago

In trying to respond to your comment on #125, I went down a bit of a rabbit hole over the past week. My goal was to reply to you on Friday, but that didn't happen and as a result, you beat me to the punch! :)

I was trying to perform testing on #125, but was hung up on which of my use-cases should/shouldn't function. I ended up reverting back to version 0.1.41 and did some playing/prototyping in a branch in my fork. This was mostly for me to collect thoughts and obtain a better understanding of the code - but ultimately I think that some of the tests may be able to be ported over to the main project.

Looking through your comments: Yes! In my prototype, I resorted to using iterall() as well - and came across the point that not all tags have keywords. In these cases, I chose to use the tag number (converted to a string) as the identifier for the tag. By converting the tag number string "(0008, 0010)" to simply the numbers "00080010", it could be used when creating a Tag object.

I've started applying my tests into this pull request - and so far it's not going all that smoothly. Would it be helpul for me to commit the tests - in whatever form they're in to the refactor/iterall branch? All of the tests are functional against my prototype branch. Against the refactor/iterall branch, there seems to be a mix of issues with the tests and issues with the changes (although that statement is made with only a cursory review - I have not dug deeply into the errors/failures). Some of the errors seem to be surrounding defining actions against private tags (ADD 11112221 StaticValue, REPLACE 00190010 123456789) and others are with how get_identifiers now returns the keys in the dictionary (in the test methods, get_file() returns a path as 'c:\gitrepositories\deid\deid/data/humans\ctbrain1.dcm'. With the change to get_identifiers, the key of the resulting dictionary has the path as 'c:\gitrepositories\deid\deid\data\humans\ctbrain1.dcm'). I was able to get around this path issue by modifying get_file to use os.path.normpath.

All of the tests are included in the test_replace_identifiers.py file.

Unfortunately, over the next couple weeks my responses will be delayed. I am being pulled into some COVID-19 projects which are going to force this off to the side for a while.

vsoch commented 4 years ago

I'm just glad you are ok! I was a bit worried to be honest, I think what would make sense is for me to try and reproduce your tests here with the new identifiers setup. We wound up doing very similar implementation, but I honor the parens and spaces for the tags index. Would that be ok with you?

vsoch commented 4 years ago

@wetzelj for this test here you are choosing an arbitrary value: https://github.com/wetzelj/deid/blob/prototype/refactor/deid/tests/test_replace_identifiers.py#L94 And either this needs to be something defined in the DicomDictionary, or if it's a private tag, we would need a type. Could you explain this particular case?

wetzelj commented 4 years ago

Yeah, this is a worrying time. Fortunately, I'm okay and have just been able to hide away in my house. Hopefully the same goes for you!

I don't see any issues keeping the parens and spaces.

Regarding the test with the arbitrary tag - yes, this is arbitrary - it's a private tag (indicated by it being a negative tag group). In my prototype, for the time being, I was thinking that absent changing the recipe definition to include a type, I would default to LO (long string) tags.py. This issue really expands itself to public tags too. What should be the behavior if a user creates a REPLACE rule for a tag that's defined as a DS (Decimal String) but tried to pass in a value of "TEXTVAL"?

vsoch commented 4 years ago

Regarding the test with the arbitrary tag - yes, this is arbitrary - it's a private tag (indicated by it being a negative tag group). In my prototype, for the time being, I was thinking that absent changing the recipe definition to include a type, I would default to LO (long string) tags.py.

Yes! We could do this - right now I default to undefined "UN" but you think LO would be better? I can update this.

This issue really expands itself to public tags too. What should be the behavior if a user creates a REPLACE rule for a tag that's defined as a DS (Decimal String) but tried to pass in a value of "TEXTVAL"?

We would need to create a test case, but generally if we find the tag in the DicomDictionary and it's of type DS and then we try to update the value to be a type not supported, it's going to throw an error. Wouldn't we want that error to be thrown, because it's not a valid replace? In the case of numbers that are read as strings, we would need to convert to int() or float(). What do you think? If you like, let me finish with your current tests (been working on since my last post) and then we can add more tests from there.

vsoch commented 4 years ago

@wetzelj what is the correct format string for 20230102011721.621000? Based on the data type the library is using '%Y%m%d%H%M%S.%f%z' but this doesn't seem to work, e.g.,

from datetime import datetime

format = '%Y%m%d%H%M%S.%f%z'
$ datetime.strptime(item_date, format)                                                                               
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/Desktop/Code/deid/deid/dicom/actions.py in <module>
----> 1 datetime.strptime(item_date, format)

~/anaconda3/lib/python3.7/_strptime.py in _strptime_datetime(cls, data_string, format)
    575     """Return a class cls instance based on the input string and the
    576     format string."""
--> 577     tt, fraction, gmtoff_fraction = _strptime(data_string, format)
    578     tzname, gmtoff = tt[-2:]
    579     args = tt[:6] + (fraction,)

~/anaconda3/lib/python3.7/_strptime.py in _strptime(data_string, format)
    357     if not found:
    358         raise ValueError("time data %r does not match format %r" %
--> 359                          (data_string, format))
    360     if len(data_string) != found.end():
    361         raise ValueError("unconverted data remains: %s" %

ValueError: time data '20230101011721.621000' does not match format '%Y%m%d%H%M%S.%f%z'
vsoch commented 4 years ago

If I get rid of the %z it seems to work however.

vsoch commented 4 years ago

okay I made good progress today - still about 3 tests to debug but I should get through them soon. Thanks for your help today!

howardpchen commented 4 years ago

Thank you for keeping me updated, @vanessasaurus I'm familiar with work from @wetzelj think it'll be quite helpful.

Life is moving full steam into COVID-19 mode - stay safe and wash hands!

On Mon, Mar 30, 2020 at 12:24 PM Vanessasaurus notifications@github.com wrote:

@howardpchen https://github.com/howardpchen you might be interested in this work as well - I'd like to address the issues that you were having earlier, if you have time to comment / take a look!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydicom/deid/pull/127#issuecomment-606101650, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2KRVQG4ECX3OBNWDIDSDDRKDBUPANCNFSM4LWFETOA .

vsoch commented 4 years ago

hey @wetzelj I refactored the tests (now added here) to include (almost all) of the ones that you were setting up - some are refactored to use the DicomParser directly (instead of it's wrapper, which is now replace_identifiers) so that we can explicitly check lists of things generated under the lookup, etc. and the few that I couldn't add, I didn't understand what you were trying to test. Do you want to take a look?

After seeing that you can index a tag by it's [group][element] without any spaces, parens, or the comma, I'm thinking of tweaking the uids that I generate to follow that convention as well. But I won't touch to make any further changes until you've had a chance to take a look.

vsoch commented 4 years ago

hey @wetzelj I know you must be swamped - just want to keep this on your radar! It's really a good change I think. Hope you are doing ok :)

wetzelj commented 4 years ago

Yep, busy but healthy! :) I don't think I could ask for a better situation. I hope you're doing well too. This is definitely still on my to-do list and will be one of the first items I circle back to once things settle down. Have a great weekend!

wetzelj commented 4 years ago

Hi @vsoch! I finally spent some time looking at this this morning. While I can't say that things are back to normal - I was at least able to review this a bit (this project is still on the shelf for me for a few more weeks). Hopefully the past couple of weeks have been good for you and you're still healthy!

In general, the newly-added tests look good, there is one that failed for me when I pulled the latest code: test_add_tag_variable

right now I default to undefined "UN" but you think LO would be better? I can update this.

Defaulting to UN is perfectly fine with me. Honestly, I'm by no means an expert in the DICOM spec - and had overlooked UN... Knowing about it now, I agree that it's a better default than LO.

what is the correct format string for 20230102011721.621000? Based on the data type the library is using '%Y%m%d%H%M%S.%f%z' but this doesn't seem to work, e.g.,

The DICOM spec allows for the DT datatype to optionally specify the %z portion of the timestamp. As a result, we really should handle both. If the length of the DT value is 26 then use format string '%Y%m%d%H%M%S.%f%z'. If the length of the DT value is 21, use format string '%Y%m%d%H%M%S.%f'.

I refactored the tests (now added here) to include (almost all) of the ones that you were setting up... and the few that I couldn't add, I didn't understand what you were trying to test.

The only one of potential consequence was "test_tag_expanders_midtag". My thought on this test was that when setting up rules using the tag/group numbers and expanders, we would want to ensure that we didn't allow for tag/group numbers to be identified "midtag". The more I think about this, it doesn't matter to me. However this is handled (or not) will be fine. Following is what I was trying to cover with this test. For example:
REMOVE contains:8001

This rule should remove tags: (8001, 0001) (8001, 0002) (2666, 8001) (2551, 8001) This rule should not remove: (0180, 0100)

I think this is going to put us WELL on our way to covering my use case. I suspect that there is a bit more work in the area of sequences (but that's outside the scope of this pull request!).

Thank you again for all of your work on this - and bearing with me while I'm pulled away from this project.

vsoch commented 4 years ago

Hi @vsoch! I finally spent some time looking at this this morning. While I can't say that things are back to normal - I was at least able to review this a bit (this project is still on the shelf for me for a few more weeks). Hopefully the past couple of weeks have been good for you and you're still healthy!

Yes, still relatively okay over here! :)

In general, the newly-added tests look good, there is one that failed for me when I pulled the latest code: test_add_tag_variable

Hmm, could you give me a way to reproduce? I just ran the test locally and all passed,

python -m unittest discover -s deid/tests/ -p '[t|T]est*.py'

And they seem to pass in the CI here. I can help to debug this, but only if I'm able to reproduce the error.

I debugged it to see where the error was occurring - but haven't looked into a fix. The "ids" passed into replace_identifiers aren't being added to parser.lookup. In my environment, on header.py - line 143, parser.dicom = 'c:\gitrepositories\deid\deid\data\humans\ctbrain1.dcm' but the key in ids is 'c:\gitrepositories\deid\deid/data/humans\ctbrain1.dcm'.

Ahh this looks like a path error! Either it's somewhere in the deid code, or possibly saved in the dicom metadata? Let's try this:

from deid.dicom.parser import DicomParser
parser = DicomParser(dicom) # Can be filename OR the loaded dicom, maybe we should try both

I suspect this is a Windows issue, so it will be about getting down to when the erroneous path is being generated. That would explain why it works on Linux and not Windows... maybe we can blame windows for it? :)

Defaulting to UN is perfectly fine with me. Honestly, I'm by no means an expert in the DICOM spec - and had overlooked UN... Knowing about it now, I agree that it's a better default than LO.

okay sounds good.

what is the correct format string for 20230102011721.621000? Based on the data type the library is using '%Y%m%d%H%M%S.%f%z' but this doesn't seem to work, e.g.,

Okay, I think we are correctly trying both strings for the timestamp format - instead of checking the length as you suggest, I just use regular expressions. I think this should work too.

        elif dcmvr == "DT":
            # NEMA-compliant format for DICOM timestamp is
            # YYYYMMDDHHMMSS.FFFFFF&ZZXX
            try:
                new_value = get_timestamp(
                    original, jitter_days=value, format="%Y%m%d%H%M%S.%f%z"
                )
            except:
                new_value = get_timestamp(
                    original, jitter_days=value, format="%Y%m%d%H%M%S.%f"
                )

The only one of potential consequence was "test_tag_expanders_midtag". My thought on this test was that when setting up rules using the tag/group numbers and expanders, we would want to ensure that we didn't allow for tag/group numbers to be identified "midtag". The more I think about this, it doesn't matter to me. However this is handled (or not) will be fine. Following is what I was trying to cover with this test.

I understand this perfectly now, and actually I think the opposite - if someone wants to specify a pattern that matches part of a group and the element, why would that be bad? The way they are represented in the dicom is without a comma/space in between. I say that we should allow this for now (and I'll add a test) and address in the future if someone has a compelling use case for why it's a terrible idea.

I think this is going to put us WELL on our way to covering my use case. I suspect that there is a bit more work in the area of sequences (but that's outside the scope of this pull request!). Thank you again for all of your work on this - and bearing with me while I'm pulled away from this project.

Definitely! I'll push changes soon.

vsoch commented 4 years ago

Ah! I actually just found the path bug - it's in the get_dataset function. Let me fix that up.

vsoch commented 4 years ago

@wetzelj see https://github.com/pydicom/deid/pull/127/commits/3c1e4a3e8fb3aef0ff6a1092b8d7d4a73042df49 - I think that should fix the issue with running the test on Windows! Sorry for the oversight, I never use / think about it these days (and I really should). I'm surprised we hadn't seen it yet, actually.

wetzelj commented 4 years ago

That took care of the path issue and all of the tests now pass for me.

vsoch commented 4 years ago

Woot! So... are we ready for merge?

wetzelj commented 4 years ago

While I didn't have the opportunity to review the changes to the same level that I had the past merges, with the addition of the passing test cases, I'd say let's go for it.

vsoch commented 4 years ago

Awesome!