vsoch commented 4 years ago

Description

This is a first shot to add parsing of private tag groups for deid. I'll definitely need help to do further testing, possibly add more tests for extracting the tags as fields, and tips based on dicom expertise / real world use cases that I might not be aware of. This PR addresses:

Related issues:

119: there isn't a global get/set strategy, so I wound up needing to walk through the code and add handling for these special cases. The other option is doing some much larger refactor and wrapping the dicom object with some general get() and set() functions to return fields and values, respectively, but the class already has a get(), and the logic to get or set a value doesn't always seem consistent between cases. I'm a bit unhappy with the fact that the implementation doesn't have a clean solution like this, but I'm thinking this should be addressed in another PR - this one is already quite big and will require a lot of work.
118 this is really the more important issue of the two - we want private tags to be included in the larger parsing, if they aren't removed.

@wetzelj here is my thinking for how we should go about this:

finish the PR here so that it works for your use case. I'm interested to know about spots that I miss, bugs that you run into, and (even better) some more tests that we can add to ensure that the newly added functionality works as expected! Note that I've added a test_dicom_fields.py file that does some basic parsing of dicom private tags. What other tests should we have there?
- when this PR is merged, I want to focus on the user interaction. Specifically, there should be a wrapper for a Dicom that exposes some get_field and get_value function that will work for any of the data types that might be found, and also a DeidDicom class that makes it much easier to run a get and replace action.

Other small changes

I don't require an ids field to be provided to replace_identifiers. It could be that there is a recipe, and values will be extracted for a dicom without running get_identifiers first.
dicom_dir() is a new function in fields.py that will essentially return dicom.dir() plus the tags keys. This allows us to easily iterate through all dicom fields (including private tags) if they weren't removed.
I added the cat.dcm as a testing dataset under animals, namely because it's the only example I have with private tags and is needed for testing.

That's about as far as my dinosaur brain can think through this today. Hopefully this is a good start, definitely fun to work on!

wetzelj commented 4 years ago

Sorry for my delay in getting back to this. I started a review this morning and have been trying to do some testing.

I believe that there's an issue that may require a change in approach. Currently, the dicom_dir function with fields.py returns a list of strings or tags. Since the type of the key value is variable, at present, this causes exceptions throughout because once we start operating on the Tag-keyed items, we're trying to compare a Tag object to a string.

fields.py - ln 222
- In this case, I question whether or not it makes sense to globally skip private tags. Since these are vendor, scanner, model specific, it may be better to force users to handle these in recipes.
actions.py - ln 105
header.py - ln 315

My gut reaction says that it may be better to convert everything to a string within dicom_dir() or somehow separate the string-key lists vs the tag-key lists - but this would then cause some other necessary conversion later. Otherwise, keeping it as is, we'll have to introduce conversions at the points listed above (which is probably not a exhaustive list - Knowing that you'd have a better perspective of impacts across the board, I stopped and decided to comment when I got to these three).

Overall, I'm starting to do some testing/review now and would be more than happy to add test cases as I work through the testing.

vsoch commented 4 years ago

@wetzelj I'm aware of the list of both tags and strings, we need to preserve the tag objects to correctly index the dicom still.

For https://github.com/pydicom/deid/blob/90bfbd6c0e25ac572701e0ee3aa24bcc1d545a41/deid/dicom/fields.py#L222 can you show a case of when skip would want to reference a private tag? As it currently is implemented, the contenders will include private tags if the requester has not removed them already and then return them in the list. Why is this an issue?

I'm also aware of https://github.com/pydicom/deid/blob/90bfbd6c0e25ac572701e0ee3aa24bcc1d545a41/deid/dicom/actions.py#L105, but since private tags are flat (or are they?) I didn't think this was an issue. I'm not sure how we would go about expanding a private tag index. Do you have an example? An actual example in data, and then how you would want it parsed, would be needed to figure this out. Right now, it would be essentially skipped because the expanded fields don't exist.

https://github.com/pydicom/deid/blob/90bfbd6c0e25ac572701e0ee3aa24bcc1d545a41/deid/dicom/header.py#L315 is definitely a bug, because the user can only provide strings and we are comparing against a list of strings and tags. But also note that the section you linked is for the put config, which is separate from the deid recipe (it's provided by the library).

The approach to convert everything to a string will lose the ability to index the dicom using the tag. Do you have a suggested approach then for being able to:

given a string, detect if it's a tag
generate the tag from it, and then do the interaction
convert back if necessary

That seems like a lot of work to have to do, because it would mean any time a tag isn't found as a string we'd need to try to test it being a tag. And then try to generate it on the fly. It seems like the approach that I took to try to maintain using the actual tag to the largest extent possible avoids that, but you are correct that we need to look out for these edge cases / errors where the tag would be missed.

vsoch commented 4 years ago

Overall, I'm starting to do some testing/review now and would be more than happy to add test cases as I work through the testing.

Great thank you! I greatly appreciate this - it's really been lovely to work with you and get your help and feedback after I do a large round of changes.

wetzelj commented 4 years ago

can you show a case of when skip would want to reference a private tag?

Honestly, no. I can't. :) The problem isn't that the private tag is in skip. The problem is that the private tag becomes a contender, since dicom_dir() is used to build the contenders list. If we agree that private tags could never be used as "skips", the following would be perfectly acceptable.

if  isinstance(contender, str) and contender in skip:
            continue

but since private tags are flat (or are they?)

Unfortunately, no, they're not guaranteed to be flat. The image below shows a sample from one of our images to be deidentified where there is a private sequence. Regardless, as with the skip value above, the breaking issue here is that "x" ends up being a Tag object and we're trying to use it in the second parameter of re.search().

PrivateSequence

I could potentially see a need to say something like REMOVE ALL (0029, 1043) and have it remove the items from both of the sequence items in the image above. But to be quite honest, I think that is the lesser use-case to the need to include the private tags when performing the REMOVE ALL values:phi scenario.

is definitely a bug, because the user can only provide strings and we are comparing against a list of strings and tags. But also note that the section you linked is for the put config, which is separate from the deid recipe (it's provided by the library).

Yes, this is in the put config, so it's not 100% relevant to my recipe-based solution...

As far as a suggestion goes - As I've been trying to come up with a suggestion that I like, one (probably naive) question keeps coming up (and yes, I realize that this would upend the design, so I'm not necessarily suggesting it): Is there some benefit that we get by building the list of fields vs using the pydicom Dataset object directly? It seems to me that, rather than dealing with the string identifiers and then using the get() to actually get the tag, we could be acting directly on the Dataset object. If we were to be able to act on the Dataset object itself, the entire question of name vs Tag goes away. Everything would be a Tag.

Thoughts? Am I crazy? :)

vsoch commented 4 years ago

The reason I decided to leave like this:

if contender in skip:
            continue

is because someone potentially could provide a tag in the list, in which case it would be skipped.

Is there some benefit that we get by building the list of fields vs using the pydicom Dataset object directly?

In that the dicom.dir() is just a list of strings, and that doing a dicom.get() works for a string or tag (on the Dataset object) we don't gain anything in not extracting some lookup (of strings) and tags first. What we lose, however, is being able to run a get_identifiers to preview the key value pairs, and then the replace_identifiers to move forward with it. You are correct that having the lookup is an extra layer, but I think it's useful for this purpose. Happy to consider another design if something more elegant is derived that allows for this functionality to persist.

vsoch commented 4 years ago

A potential refactor that I could see is to derive a recursive model of a Field, where it would expose get, set, and (inner) handle all of the discrepancies between tag and other. But that would just be cleaning up the current PR here and doing a major refactor - if the design here is fundamentally flawed that probably won't work either. But it would be an overall better design to provide a consistent interface to interact with "some field object from a dicom Dataset"

vsoch commented 4 years ago

And sharing ideas is never crazy - it's great and fun, and better to get them out for discussion than let them cause a ruckus in your frontal and parietal lobes!

wetzelj commented 4 years ago

There's still a bug with if contender in skip:: Given: contenders = ['PatientID', (0009,0001)] skip = ['Field1', 'Field2']

When performing this loop and on the iteration for (0009,0001), we get an exception "Cannot compare Tag with non-Tag item". If we want to enable Tags as skips, then I think we're going to need to include type comparisons as well.

I'm going to let the rest of it sit overnight and see what new ideas come to me tomorrow.

vsoch commented 4 years ago

I'm looking over get_fields, and the primary use of the list skip is to prevent the lookup from returning PixelData and other fields that normally aren't parsed. Now that contenders comes from dicom_dir, I can see that this loop would be triggered. So - I'm going to test providing a list of tags to the function:

# skip is a list of private tags from your animal image
[(0019, 0010),
 (0019, 1007),
 (0019, 1021),
 (0019, 1028),
 (0019, 1030),
 (0019, 10f5),
 (0019, 10fa),
 (0019, 10fb),
 (0019, 10fc),
 (0019, 10fd),
 (0019, 10fe)]

# this should trigger an error?
get_fields(dicom, skip=skip)

but there isn't an error, I get back the full data structure (private tags included!)

{'AcquisitionContextSequence': '[]',
 'AcquisitionDate': '20200209',
...
 'ViewCodeSequence__CodeMeaning': 'antero-posterior',
 'ViewPosition': 'AP',
 'WindowCenter': '2048',
 'WindowWidth': '4096',
 (0011, 0003): 'Agfa DR'}   # this is the only private tag that wasn't in the skip list

so I'm not able to reproduce this (I don't get an exception):

When performing this loop and on the iteration for (0009,0001), we get an exception "Cannot compare Tag with non-Tag item". If we want to enable Tags as skips, then I think we're going to need to include type comparisons as well.

Could there be some differences in our versions of pydicom, or something like that? Can you provide me with an exact reproduction of this error? Also note in case you missed it that we've tested expand_field_expression to use a string to return a list of tags (works!)

print("Testing that we can also search private tags based on numbers.")
fields = expand_field_expression(
    dicom=dicom, field="contains:0019", contenders=contenders
)
[(0019, 0010),
 (0019, 1007),
 (0019, 1021),
 (0019, 1028),
 (0019, 1030),
 (0019, 10f5),
 (0019, 10fa),
 (0019, 10fb),
 (0019, 10fc),
 (0019, 10fd),
 (0019, 10fe)]

# We should have a tag object in the list now!
assert isinstance(fields[0], BaseTag)

wetzelj commented 4 years ago

Take a look at this (recreating the issue with the cat image): https://github.com/wetzelj/deid/commit/85de34862c321813609da168f9439391f82446ca

The two new test cases (in progress - for discussion purposes only) demonstrate the issues that I've seen above. Both of these cases generate exceptions and are representative of the issues that I hit when trying to run replace_identifiers on a "real" run.

I've also added a new dicom test file. The file was a Head CT. The image has been replaced and the header has been fully deidentified, but otherwise all fields included in the header maintain the structure as it was in the original image.

vsoch commented 4 years ago

Thank you @wetzelj ! For the testing of fields, I've reproduced the issue with having expand_sequences parsing a list of fields that have private tags. The fix I added just now is to parse these as a string: https://github.com/pydicom/deid/pull/125/commits/90e322b5cd1c9bc1bf9e584acb6cb7d8beb13fe8

What this means is that the list of expanded sequences, if a string for a private tag matches (unlikely) is unlikely to actually be useful for a further action. However, I don't see it being reasonable to support private tags nested in other private tags, so I think this is okay for now.

I'm not able to trigger this error locally when I run the test, however - is this leading to an error for you?


    def test_skip_list(self):
        # Demonstrates the bug with deid/dicom/fields.py - ln 222
        print("Test exclusion of items from skip list")
        from deid.dicom.fields import dicom_dir

        skip = ['PixelData']

        dicom = get_dicom(self.dataset)
        contenders = dicom_dir(dicom)        
        newcontenders = []

        for contender in contenders:
            if contender in skip:
                continue
            else:
                newcontenders.append(contender)

        with self.assertRaises(ValueError) as exc:
            pixeldata = newcontenders.index('PixelData')

        self.assertEqual("'PixelData' is not in list", str(exc.exception))

If you are getting an error (and I'm not) it is possibly related to a version of pydicom so we should test these details. Also, any changes that you add (tests, new datasets, which are greatly appreciated!) when all is said and done, you can PR to the branch here so that they are included.

wetzelj commented 4 years ago

I agree that there shouldn't be a conflict between named and non-named fields.

However, I think at some point, we will have to handle private tag sequences. If you step through the new medical image through test_conditional_expansion, you'll see that fields = get_fields(dicom, expand_sequences=True) does not expand private tag sequences. The private tags that are sequences remain as sequence Tag objects.

It's probably okay to take a pass on this issue for now. Looking at the new medical dicom image I uploaded - we would just have to be aware that if the user was processing a rule like:

REMOVE CodeValue Would remove tags within sequences: 'CodeValue' 'CTDIPhantomTypeCodeSequenceCodeValue' 'ProcedureCodeSequenceCodeValue' 'RequestAttributesSequenceRequestedProcedureCodeSequenceCodeValue' 'RequestedProcedureCodeSequence__CodeValue'

Whereas something like: REMOVE (0029, 0010) Would remove only: (0029, 0010) and would not remove the 2 occurrences of: (0029, 1140)__(0029, 0010)

BTW... I'm about to hop into a few meetings, but am planning to first submit a PR to this branch for just the new medical test images.

wetzelj commented 4 years ago

Regarding test_skip_list - yes, this is triggering an error. On the iteration of for contender in contenders when the contender value is (0011,0003) an exception is thrown: TypeError("Cannot compare Tag with non-Tag item")

Following is the complete list of packages with version that I'm testing against. Pydicom is still using 1.2.1 as this exact version is required in version.py. Could this be an issue with the version of python?

# Name                    Version                   Build  Channel
astroid                   2.3.1                    py36_0  
blas                      1.0                         mkl  
certifi                   2019.9.11                py36_0  
colorama                  0.4.1                    py36_0  
cycler                    0.10.0                   pypi_0    pypi
deid                      0.1.41                    dev_0    <develop>
icc_rt                    2019.0.0             h0cc432a_1  
intel-openmp              2019.4                      245  
isort                     4.3.21                   py36_0  
lazy-object-proxy         1.4.2            py36he774522_0  
matplotlib                2.1.2                    pypi_0    pypi
mccabe                    0.6.1                    py36_1  
mkl                       2019.4                      245  
mkl-service               2.3.0            py36hb782905_0  
mkl_fft                   1.0.14           py36h14836fe_0  
mkl_random                1.1.0            py36h675688f_0  
numpy                     1.17.2                   pypi_0    pypi
pandas                    0.25.1           py36ha925a31_0  
pip                       19.2.3                   py36_0  
pydicom                   1.2.1                    pypi_0    pypi
pylint                    2.4.2                    py36_0  
pynetdicom                1.4.1                    pypi_0    pypi
pyparsing                 2.4.2                    pypi_0    pypi
python                    3.6.9                h5500b2f_0  
python-dateutil           2.8.0                    py36_0  
pytz                      2019.2                     py_0  
pyyaml                    5.1.2            py36he774522_0  
setuptools                41.2.0                   py36_0  
six                       1.12.0                   pypi_0    pypi
sqlite                    3.29.0               he774522_0  
typed-ast                 1.4.0            py36he774522_0  
vc                        14.1                 h0510ff6_4  
vs2015_runtime            14.16.27012          hf0eaf9b_0  
wheel                     0.33.6                   py36_0  
wincertstore              0.2              py36h7fe50ca_0  
wrapt                     1.11.2           py36he774522_0  
yaml                      0.1.7                hc54c509_2

vsoch commented 4 years ago

Weird, so the line for that particular error (https://github.com/pydicom/pydicom/blame/caf0db105ddf389ff5025937fd5f3aa1e61e85e7/pydicom/tag.py#L156) was added 8 years ago, so it should be triggering for me too. Based on the function that I see here it looks like it's a less than function, so it means we are asking if a tag is somehow less than another object? I'm wondering - if you open up a PR with this test, does the error reproduce? It didn't trigger for me locally when I ran the test file - only the second example did.

For the expansion of private tags - I see the issue. I think we could work on this, but do you have an example image that has a sequence of nested private tags that we could use to develop?

vsoch commented 4 years ago

Oops and one correction to the above - we are using the equals comparator here https://github.com/pydicom/pydicom/blob/master/pydicom/tag.py#L168 and not less than (of course!). But this is still 8 years old, so the conclusions (that I should see the error too!) are the same.

vsoch commented 4 years ago

Oh! And here is a difference - I actually have pydicom 1.4.1

pydicom==1.4.1

which is (almost) at the latest - I think I installed it a while back because I wanted to test if everything continued to work. Could perhaps this be something that would explain the differences?

vsoch commented 4 years ago

That did it! I downgraded to 1.2.1 and I reproduced your issue:

======================================================================
ERROR: test_conditional_expansion (test_dicom_fields.TestDicomFields)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/vanessa/Desktop/Code/deid/deid/tests/test_dicom_fields.py", line 123, in test_conditional_expansion
    expanded_fields = [x for x in fields if re.search(expanded_regexp, x)]
  File "/home/vanessa/Desktop/Code/deid/deid/tests/test_dicom_fields.py", line 123, in <listcomp>
    expanded_fields = [x for x in fields if re.search(expanded_regexp, x)]
  File "/home/vanessa/anaconda3/lib/python3.7/re.py", line 183, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object

======================================================================
ERROR: test_skip_list (test_dicom_fields.TestDicomFields)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/vanessa/anaconda3/lib/python3.7/site-packages/pydicom/tag.py", line 168, in __eq__
    other = Tag(other)
  File "/home/vanessa/anaconda3/lib/python3.7/site-packages/pydicom/tag.py", line 97, in Tag
    long_value = int(arg, 16)
ValueError: invalid literal for int() with base 16: 'PixelData'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vanessa/Desktop/Code/deid/deid/tests/test_dicom_fields.py", line 93, in test_skip_list
    if contender in skip:
  File "/home/vanessa/anaconda3/lib/python3.7/site-packages/pydicom/tag.py", line 170, in __eq__
    raise TypeError("Cannot compare Tag with non-Tag item")
TypeError: Cannot compare Tag with non-Tag item

----------------------------------------------------------------------
Ran 3 tests in 0.060s

FAILED (errors=2)

vsoch commented 4 years ago

We probably need to be able to still support both at this point, I'll dig into this now that I've reproduced! I love it when that happens :)

wetzelj commented 4 years ago

For the expansion of private tags - I see the issue. I think we could work on this, but do you have an example image that has a sequence of nested private tags that we could use to develop?

Yep! See PR #126. Sequence (0029, 0010) contains other private tags.

We probably need to be able to still support both at this point, I'll dig into this now that I've reproduced! I love it when that happens :)

I'm pretty sure the only reason I was back at 1.2.1 was for deid. I'm going to upgrade to 1.4.1 (or latest?) Either way... Yay! You were able to reproduce the issue! :)

vsoch commented 4 years ago

okay I just added a fix for the issue of parsing through tags (and not being able to do the comparison) https://github.com/pydicom/deid/pull/125/commits/87a110335c149e486aca0de104d64a1742389b87#diff-0c993b1a3139505b68bd0fe50eedc74eR223. What I'm looking at now is that dicom_dir() doesn't return private tag sequences, the reason being that get_private doesn't return sequences with private tags. I'm going to look at this next - my thinking is that if dicom.dir() can return a sequence as long as it's a string index, the get_private should also be able to return private tags that are of type Sequence (or a tag equivalent?) Haven't looked deeply into it yet.

vsoch commented 4 years ago

One small note - it's (0029, 1140) that is private that also contains other private (the one you mentioned isn't). And i check for sequence via data_element.VR == "SQ".

vsoch commented 4 years ago

Ah I see what is happening - so we actually are getting all private tags with get_private, however we are unwrapping them from their structure. For example, to start we have:

(0029, 1140)  [Application Header Sequence]   2 item(s) ---- 
   (0029, 0010) Private Creator                     LO: 'SIEMENS MEDCOM HEADER'
   (0029, 1041) [Application Header Type]           CS: 'VIA_CARE_KV_WIND'
   (0029, 1042) [Application Header ID]             LO: 'ORIG WINDOW VALUES FROM CAREKV'
   (0029, 1043) [Application Header Version]        LO: 'V1 20120914'
   ---------
   (0029, 0010) Private Creator                     LO: 'SIEMENS MEDCOM HEADER'
   (0029, 1041) [Application Header Type]           CS: 'SOM 5 TPOS'
   (0029, 1042) [Application Header ID]             LO: 'SOM 5 NULLPOSITION'
   (0029, 1043) [Application Header Version]        LO: 'VB10A 20030626'

And then the result returns a flattened:

[(0029, 0010) Private Creator                     LO: 'SIEMENS MEDCOM HEADER',
 (0029, 1041) [Application Header Type]           CS: 'VIA_CARE_KV_WIND',
 (0029, 1042) [Application Header ID]             LO: 'ORIG WINDOW VALUES FROM CAREKV',
 (0029, 1043) [Application Header Version]        LO: 'V1 20120914',
 (0029, 0010) Private Creator                     LO: 'SIEMENS MEDCOM HEADER',
 (0029, 1041) [Application Header Type]           CS: 'SOM 5 TPOS',
 (0029, 1042) [Application Header ID]             LO: 'SOM 5 NULLPOSITION',
 (0029, 1043) [Application Header Version]        LO: 'VB10A 20030626']

So this get_private might work as we want, because it's essentially providing a lookup for private tags.

vsoch commented 4 years ago

So your concern is with respect to:

Would remove only:
(0029, 0010)
and would not remove the 2 occurrences of:
(0029, 1140)__(0029, 0010)

meaning that we somehow need to represent these nested tags as full fledged citizens when this action is happening... :/

wetzelj commented 4 years ago

I think that approach should work.

However, in starting to look in more detail at sequence expansion, I've noticed that there's actually a bug here. As currently implemented, when extract_sequences() flattens the sequence, it simply flattens the fields within the sequence, ignoring the fact that sequences can be arrays.

Given a header of:

(0008, 1032)    Procedure Code Sequence
  >BEGIN ITEM 1
  >(0008,0100)      Code Value : CT0001
  >(0008,0102)      Coding Scheme Designator : SIEMENS_RIS
  >(0008,0104)      Code Meaning : CT BRAIN WO IVCON
  >END ITEM 1
(0032,1032)     Requesting Physician : HIBBARD^JULIUS^^

This get_fields()/extract_sequences() function converts this (correctly in this case) to:

ProcedureCodeSequence__CodeValue : CT0001
ProcedureCodeSequence__CodingSchemeDesignator : SIEMENS_RIS
ProcedureCodeSequence__CodeMeaning : CT BRAIN WO IVCON
Requesting_Physician :  HIBBARD^JULIUS^^

However, when there are multiple items in the sequence:

(0008, 1032)    Procedure Code Sequence
  >BEGIN ITEM 1
  >(0008,0100)      Code Value : CT0001
  >(0008,0102)      Coding Scheme Designator : SIEMENS_RIS
  >(0008,0104)      Code Meaning : CT BRAIN WO IVCON
  >END ITEM 1
  >BEGIN ITEM 2
  >(0008,0100)      Code Value : CT_6352
  >(0008,0102)      Coding Scheme Designator : AGFA_PACS
  >(0008,0104)      Code Meaning : CT BRAIN WO CONTRAST
  >END ITEM 2
(0032,1032)     Requesting Physician : HIBBARD^JULIUS^^

The get_fields()/extract_sequences() function incorrectly converts this to below (losing the first occurrence item:

ProcedureCodeSequence__CodeValue : CT_6352
ProcedureCodeSequence__CodingSchemeDesignator : AGFA_PACS
ProcedureCodeSequence__CodeMeaning : CT BRAIN WO CONTRAST
Requesting_Physician :  HIBBARD^JULIUS^^

Instead, it is really important to maintain the integrity of these occurrences - potentially building something like this:

ProcedureCodeSequence__0__CodeValue : CT0001
ProcedureCodeSequence__0__CodingSchemeDesignator : SIEMENS_RIS
ProcedureCodeSequence__0__CodeMeaning : CT BRAIN WO IVCON
ProcedureCodeSequence__1__CodeValue : CT_6352
ProcedureCodeSequence__1__CodingSchemeDesignator : AGFA_PACS
ProcedureCodeSequence__1__CodeMeaning : CT BRAIN WO CONTRAST
Requesting_Physician :  HIBBARD^JULIUS^^

The cthead1.dcm file that I added with #126 has an instance of a private tag sequence with multiple occurrences. In my 125/new-samples branch, I've added an additional sample file with multiple-occurrence public tags. As well as a test case for this situation. Before submitting a PR, I want to make the test case less tightly coupled to the humans dataset.

All that said, do you think it would be better to create an additional issue for sequence expansion, and keep this PR to supporting private tags in non-Sequences?

vsoch commented 4 years ago

I see what you mean about interacting with fields directly, because when we do dicom.dir() we don't get private tags, when we do dicom_dir(dicom) we get private tags, but then they are unwrapped. I'm not really sure what action to take at this point, because we are extracting flattened structures but deid doesn't have a way for the user to even reference sequences.

wetzelj commented 4 years ago

I just saw your most recent comments... (after I posted my last one).

I noticed that too about how private sequences are unwrapped. Ultimately, I think there is quite a bit outstanding with sequences in general.

vsoch commented 4 years ago

But @wetzelj please see the example that I just posted - https://github.com/pydicom/deid/pull/125#issuecomment-601426435 at least in the case of get_private we do preserve each of entries indexed with 0 and 1. In the case that they are converted to strings, however, I can see this being lost.

vsoch commented 4 years ago

I can help only to the extent there is a reasonable idea / discussion for how to implement a change or fix. If it requires a total refactor of the library that sort of sucks, but I'm open to the idea.

wetzelj commented 4 years ago

I understand what you're saying in #125 (comment), but with the multiple occurrences in the lookup, looking up (0029, 1041) would result in multiple values. Maybe this is okay?

Either way, do you think that it would be worthwhile to forego the sequence issue to a new issue and focus this PR on non-sequence private tags? From my perspective, being able to handle non-sequence private tags has huge benefit.

vsoch commented 4 years ago

@wetzelj this work is all being done on your behalf, so I will go with whatever you prefer. If the library is improved from where it was, I say we should move forward. But I'd still like to work with you (in likely another scoped issue) to address these other issues, because I won't sleep at night if I've realized that my software sucks. :P

vsoch commented 4 years ago

Here is something to think about - how important is it to have separate get and set identifiers functions? Because I can imagine implementing a version that removes the intermediate dictionary layer and just interacts with the pydicom Dataset. Instead of deriving values and looping through later, we figure out how to do it once. We’d still need to balance that with some ability to define custom functions and lists, however.

wetzelj commented 4 years ago

Your software is great! These are just some little tweaks to make it more awesome than it already is.

Regarding separate get/set identifiers functions - I'm formulating my thoughts on this, I want to make sure that I have a coherent, well thought out proposal before I respond.

vsoch commented 4 years ago

Sounds good! Let me know after you'd had some time to think:

what still remains for this PR, is the library overall improved? Worth a release?
what your thoughts are for next steps

I do think it would be worth a try to maintain the current get/set identifiers, but write a class that takes in dicom files, custom functions and a recipe, and can just be used to clean the headers. It would be akin to the DicomCleaner but for header (and not pixels).

vsoch commented 4 years ago

Ping! Hope you are doing ok :)

vsoch commented 4 years ago

This PR isn't perfect, but it improves the library quite a bit, so I'm going to merge. Please open a new issue @wetzelj when you are ready to discuss next steps.

pydicom / deid

Add support to parse private tags #125

Description

Related issues:

118 this is really the more important issue of the two - we want private tags to be included in the larger parsing, if they aren't removed.

Other small changes