Discussion for tag list groups

vsoch commented 4 years ago

This is discussion continued from https://github.com/pydicom/deid/pull/112#issuecomment-595326430.

I think the way you've described this new section is confusing

I agree... when I came back into the office this morning and re-read, I didn't like it either. Really, I think I merged two purposes (scanning the tag values and identifying the targeted tag identifiers). For the sake of clarity, I'm going to try again with an example. Which I hope will also illustrate where the additional section could be beneficial. At the moment, I'm also specifically ignoring a potential wholesale refactor of the replacement. I need to do more studying of the current process to make informed comments... I'll continue to think on that point.

Sample Header

(0008,0050) : SH   Len: 10     AccessionNumber                Value: [999999999 ]
(0008,0070) : LO   Len: 8      Manufacturer                   Value: [SIEMENS ]
(0008,1090) : LO   Len: 22     ManufacturersModelName         Value: [SOMATOM Definition AS+]
(0009,0010) : LO   Len: 20     PrivateCreator10xx             Value: [SIEMENS CT VA1 DUMMY]
(0010,0010) : PN   Len: 14     PatientsName                   Value: [SIMPSON^HOMER^J^]
(0010,0020) : LO   Len: 12     PatientID                      Value: [000991991991 ]
(0010,1000) : LO   Len: 8      OtherPatientIDs                Value: [E123456]
(0010,1001) : PN   Len: 8      OtherPatientNames              Value: [E123456]
(0010,21B0) : LT   Len: 90     AdditionalPatientHistory       Value: [MR SIMPSON LIKES DUFF BEER]
(0019,1091) : DS   Len: 6      <Unknown Tag>                  Value: [E123456]
(0019,1092) : DS   Len: 6      <Unknown Tag>                  Value: [M123456]

For the sample project, the goal would be to remove patient identifiers, but there's also a need to replace any references to the scanner's vendor, make, and model. In the prior suggestion, I discussed a new section called %scandefinition. I think a more appropriate name for this section would be %tagvaluelist.

Proposed Recipe (new sections)

%tagvaluelist patient_info
SPLIT PatientsName splitval='^';minlength='4'
FIELD PatientID
FIELD OtherPatientIDs
REGEX (M\d+)

%tagvaluelist scanner_info
FIELD Manufacturer
SPLIT ManufacturersModelName splitval=' ',minlength='4'

The goal of these new sections would be to create lists of strings (processed tag values) that could be referenced by name in the header section. At this point, the tagvaluelist sections would simply be creating two lists of strings, but at runtime, these would be evaluated to image-specific lists of tags:

patient_info                    scanner_info
-------------                   --------------
HOMER                       SIEMENS
SIMPSON                     SOMATOM
000991991991                    Definition
E123456
(M\d+)

The existing header section pattern would remain unchanged ([ACTION] [FIELD] [VALUE]), except now, whereas the FIELD must either be a header named field or the keyword ALL, tagvaluelist names could be specified in actions.

Propsed Recipe (%header) I'm listing two options here... these are referenced below in the implementation/processing replacement section. Option A:

%header
REMOVE patient_info
REPLACE scanner_info var:myvar

Option B:

%header
REMOVE func:getTags(patient_info)
REPLACE func:getTags(scanner_info) var:myvar

Implementation/Processing Replacement (again, pushing off a refactor for performance for the time being)

Within replace identifiers, before processing actions, we would need to convert the tagvaluelists into specific tags to be acted on. This would need to occur before actions were applied to the dicom file, and given our sample would create the following tag lists using the tagvaluelists. The recipe option A and B that I listed above would just drive what we think is most clear in the recipe - would we want an implicit or explicit conversion from tagvaluelists to tag values. Regardless of the option chose, the tagvaluelists could be converted to actual tags:

patient_info tags
------------
(0010,0010) : PN   Len: 14     PatientsName
(0010,0020) : LO   Len: 12     PatientID
(0010,1000) : LO   Len: 8      OtherPatientIDs
(0010,1001) : PN   Len: 8      OtherPatientNames 
(0010,21B0) : LT   Len: 90     AdditionalPatientHistory 
(0019,1091) : DS   Len: 6      <Unknown Tag>
(0019,1092) : DS   Len: 6      <Unknown Tag>

scanner_info tags
------------
(0008,0070) : LO   Len: 8      Manufacturer 
(0008,1090) : LO   Len: 22     ManufacturersModelName 
(0009,0010) : LO   Len: 20     PrivateCreator10xx

At this point, standard the recipe actions could be performed in pretty much exactly the same way as a rule like REMOVE ALL or REPLACE ALL var:myvar works today. The only difference is that the ALL keyword would be a different value tying to the tagvaluelist which evaluates to a subset of the tags.

The Results Given these adjusments, the final deidentified header would look like this:

(0008,0050) : SH   Len: 10     AccessionNumber                Value: [999999999 ]
(0008,0070) : LO   Len: 8      Manufacturer                   Value: [Replacement Variable Value]
(0008,1090) : LO   Len: 22     ManufacturersModelName         Value: [Replacement Variable Value]
(0009,0010) : LO   Len: 20     PrivateCreator10xx             Value: [Replacement Variable Value]

vsoch commented 4 years ago

Okay first I'd like to clarify my understanding. This section:

%tagvaluelist patient_info
SPLIT PatientsName splitval='^';minlength='4'
FIELD PatientID
FIELD OtherPatientIDs
REGEX (M\d+)

%tagvaluelist scanner_info
FIELD Manufacturer
SPLIT ManufacturersModelName splitval=' ',minlength='4'

Says that for each named tagvaluelist, we can ultimately expect a list of strings. The commands relevant are:

SPLIT: is in the format SPLIT <field> <parameters> and the expected parameters can be minlength and splitval. So if the user doesn't require a minlength, if a string is split and one entry is an empty string, I assume this is removed - so is the implicit minlength 1? And if the user says SPLIT but provides no value, is the default an empty space? Do we clean up the split too, stripping of whitespace?
FIELD says just include the entire content of whatever is provided at the field (no splitting or custom parsing)
REGEX: this is confusing because you don't provide any field. It's not clear how this is applied, because so far one line has equaled one list to produce. Are you saying that we check ALL fields for this regular expression? What is this particular expression actually checking for? Is the idea that we are actually doing a re.search for every item in a %tagvaluelist - what if someone adds a REGEX but then uses the list in a context where it isn't appropriate? So if we wanted to do REGEX against a specific field would we do:

REGEX PatientID Sam

But how is that different from:

REMOVE ALL re:Sam

Other than being included in this list first?

I think what would be most intuitive is to have this set of key values (the name of the tag list and the list itself) to the item lookup, and then treat it as a variable. I'm taking your example and modifying because we need an identifier for a taglist. What is still missing is a way to say "This is going to remove based on checking values, and not field names" - do you have a suggestion?

%header
REMOVE tagvaluelist:patient_info
REPLACE tagvaluelist:scanner_info var:myvar

I was going to add VALUE after REMOVE, but I think there could be a case of a field called VALUE and then the recipe format would break because it indicates two different things. The above is also assuming it's called a "tagvaluelist" - maybe we can think of something better. Mixing python code in with the header doesn't directly tell the user where that function is from, which is why I didn't choose your second example. But for the example above, if we used patient_info, it would hit a regular expression.

vsoch commented 4 years ago

Also taking into account what a more full recipe looks like (https://github.com/pydicom/deid/blob/master/examples/deid/deid.dicom) with filters:

FORMAT dicom

%filter dangerouscookie

LABEL Criteria for Dangerous Cookie
contains PatientSex M
  + notequals OperatorsName bold bread
  coordinates 0,0,512,110

%filter bigimage

LABEL Image Size Good for Machine Learning
equals Rows 2048
  + equals Columns 1536
  coordinates 0,0,512,200

%header

ADD PatientIdentityRemoved Yes
REPLACE PatientID var:id
REPLACE SOPInstanceUID var:source_id

We are basically proposing to add a section that is general and lets the user define a list of things based on some parsing / criteria. This means that reading in the recipe would store the actions with the recipe object, but the actual derivation of the list would be run at the onset of reading each dicom. I'm wondering if we would want to make this ability to define groups of tags as something more general that can be potentially used in other functions for deid (e.g., filter groups). For example, what if instead of tagvaluelist we did:

%group patient_info
SPLIT PatientsName splitval='^';minlength='4'
FIELD PatientID

If we allow for expansion of field names, we should allow them here too.

%group patient_info
SPLIT PatientsName splitval='^';minlength='4'
FIELD startswith:Patient

The use of group implies a list, even if it's just one item. And then in the recipe, it's referenced as a group, and used in a way that is allowed.

%header
REMOVE VALUE group:patient_info
REPLACE group:scanner_info var:myvar

And then VALUE would be a special case, but again, it could be a field so that might not be the perfect solution. But since it's general, if/when it's requested, it could be used as a filter parameter as well:

%filter dangerouscookie

LABEL Criteria for Dangerous Cookie
contains group:patient_indo
  + notequals OperatorsName bold bread
  coordinates 0,0,512,110

This isn't something we'd develop now, but the idea is that we should be able to extend the usage if requested, and not have some variable name that is hard coded just for the header section. Technically, another section is not nested in header, and shouldn't just be associated with it.

Let me know your thoughts @wetzelj.

wetzelj commented 4 years ago

Your understanding of SPLIT and FIELD are spot on. I would think that for both of these we should trim leading/trailing whitespace and eliminate empty strings. As far as the default parameters go, I think that using an default minlength=1 and default split value of something would be acceptable. In my mind, there are really two values that could reasonably be the default split value - either empty space or ^. I'm making an assumption that the ^ character is used in the header fields because this is how the field would be received in an HL7v2 message. But honestly, I'd probably flip a coin on this decision.

Regarding REGEX... My thought was that in context of a %tagvaluelist the REGEX type would never allow a field. The format would always only be REGEX <pattern>.

Functionally, specifiying:

%tagvaluelist mpi_number
REGEX (M\d+)

%header
REMOVE tagvaluelist:mpi_number

is exactly the same as:

REMOVE ALL re:(M\d+)

The only benefit/reason to potentially allow the REGEX type in the tagvaluelist is for logical grouping. If we wanted to create a group to handle other patient information, it creates a nice grouping to encapsulate the REGEX pattern for the master patient identifier in with the other patient info rather than using a separate REMOVE ALL re:(M\d+) command.

I think what would be most intuitive is to have this set of key values (the name of the tag list and the list itself) to the item lookup, and then treat it as a variable. I'm taking your example and modifying because we need an identifier for a taglist. What is still missing is a way to say "This is going to remove based on checking values, and not field names" - do you have a suggestion?

If I understand this correctly, it's mainly about ensuring that we're visually clear in the action entries [ACTION] [FIELD] [VALUE] that the the [FIELD] in this instance is the conversion of a value list to a list of tags. What do you think of something like this:

%header
REMOVE TAGS_MATCHING(tagvaluelist:patient_info)
REPLACE TAGS_MATCHING(tagvaluelist:scanner_info) var:myvar

It doesn't follow any other pattern in deid, but the similarity to an excel-type function may be widely understood.

I think that expanding this functionality to a full recipe would be awesome.

The only thing that I was a little unclear on was the use of the VALUE keyword. Is my understanding below correct?

With the %group:patient_info pattern, the defined group would still be the list of strings (HOMER|SIMPSON|E1234) and so this list of strings. Since it's a list of strings, it could be used in the %filter sections. The VALUE keyword within REMOVE VALUE group:patient_info would then be the trigger that causes the conversion the list of value strings into a list of tags.

vsoch commented 4 years ago

The typical usage for a REMOVE is to target a FIELD as the second variable, e.g.,

REMOVE PatientID

However in the way you suggested it, you are providing a list of values, but they aren't for fields - they are for a list of values that could be found in any field. But if we wrote this:

REMOVE group:patient_info

intuitively that reads as "Remove the group of friends represented in patient_info" and not "remove the group of fields that include any values from patient_info. Thus, we would need a way to make it clear that we are searching over values and not fields. Even saying this:

REMOVE ALL group:patient_info

Would be better to say "Remove all fields that are in the group "patient_info". But that's probably not clear enough, because group still could be a list of values, no? Actually, now that I think of it, even saying:

%group patient_info
FIELD PatientID

Is confusing because it's not clear if we want the field name itself, or the value inside. We might actually want:

%group patient_info
VALUE PatientID

to explicitly say the value. And then for the action, deid doesn't know the difference between a group of values vs. fields (or even potentially both if the user does something weird). So I had suggested adding VALUES to indicate this:

REMOVE VALUES group:patient_info

But that's not very good because VALUES could be a field name, potentially. But what if we just simplified it and made the group type explitit?

%values patient_info
FIELD PatientID

and then saying

REMOVE values:patient_info

"Remove all values that include anything in the list patient_info" and would be different than

REMOVE fields:patient_info

"Remove all fields that are in the list patient info.

In summary:

A %values section defines a list of values, FIELD implies a get()
A %fields section defines a list of fields, FIELD refers to the name itslef
Either, when found as a prefix, indicates to remove / take action based on either a field or value found.

Ah, and I hate Microsoft products, so I am definitely not a fan of the Excel function look, haha. :)

I'm closing up shop soon, so likely I won't respond again until tomorrow. Let me know your thoughts on the above!

wetzelj commented 4 years ago

This approach is great! It keeps the recipe clear and explicit while providing the flexibility needed.

Let me know how you would like to proceed from here. In one of your prior responses, you mentioned that you would "likely want to take charge of the work", however, I'd like to assist in any way you think would be beneficial.

vsoch commented 4 years ago

I should be able to make time this weekend to get this underway! I will need huge help to test, possibly write a few new tests, and two issues I’m interested in about updating the version requirement for pydicom and lifting for matplotlib. I’m hugely busy this morning as I have meetings and a podcast recording, but hopefully might even be able to get started this evening. How would you like to proceed with #113 - I’d possibly like to include this feature before this next refactor and it would be good to get it tested and reviewed.

wetzelj commented 4 years ago

I've done some preliminary testing on #113 and will be performing some additional testing this afternoon. In general, I think it is good functionality to have at our disposal. We'll definitely be able to help with testing.

In general- are you okay with me opening other issues? I'd like to get an issue out there for the Private Tag inclusion.

vsoch commented 4 years ago

Yes 100%! I think here is how I see next steps:

finish up #113 since REMOVE with a function is useful to have for the current implementation
I'll work on adding the features we discussed here for the next PR
I think this recipe update will likely require a version bump, and once we have bump, I see two rounds of work - the first to optimize how the parsing is done, and then the second to make it applicable to private tags too.

In the last point, figuring out if we can update pydicom is important as well.

vsoch commented 4 years ago

Just a quick note - I'm thinking instead of splitval to just use by since a parameter for SPLIT implies split. So something like:

%values cookie_names
SPLIT PatientsName by='^'; minlength='4'

Is there any reason to not allow any number of parameters after SPLIT and the field (or expander) so instead we would just do:

%values cookie_names
SPLIT PatientsName by="^" minlength=4

But hmm I'm rethinking this now - it would get weird with a space within the comma. The ; is just so ugly, but it does strongly indicate usually the end of a line, so maybe it's not such a bad idea :P

vsoch commented 4 years ago

Another question - would a # ever be a split value? The reason is because I want to clean up any comments on the line. If splitvalue is #, then we would run into trouble!

wetzelj commented 4 years ago

Good thought. I like the way that reads. As for the # as a delimiter - personally, I think it’s an okay limitation to say it can’t be used as the delimiter.

vsoch commented 4 years ago

Groups were added in #120.

pydicom / deid

Discussion for tag list groups #115