pydicom / deid

best effort anonymization for medical images using python
https://pydicom.github.io/deid/
MIT License
140 stars 43 forks source link

Incorrect logic in detect.py module #142

Closed rmw362 closed 3 years ago

rmw362 commented 4 years ago

Hi,

I really appreciate all of your efforts on this project...super helpful. I came across some behavior of the detect.py module that I believe is counter to how you intended the module to work.

First, the logic of how the detect.py module flags images is incorrect, and this is due to improper indentation in the evaluate groups function starting with "#At the end of the group, evaluate the inner group" below. This section needs to be indented to be included with the preceeding for loop (as I have done below), otherwise it is evaluating outer groups rather than inner groups (this is the current functionality as written). The result is inappropriate behavior where if the header meets the final criteria in the outer groups, it will be flagged regardless of whether it met criteria for previous groups. For example, I have an US DICOM which has 768 rows in the pixel data, and the image was flagged with every filter with equals Rows 768 as the last criteria. Thankfully a simple tab solves the issue. Additionally, you need to rename the output as overall_result to avoid a namespace conflict with results so that this solution works.

def _has_burned_pixels_single(dicom_file, force, deid):

    """has burned pixels single will evaluate one dicom file for burned in
       pixels based on 'filter' criteria in a deid. If deid is not provided,
       will use application default. The method proceeds as follows:
       1. deid is loaded, with criteria groups ordered from specific --> general
       2. image is run down the criteria, stops when hits and reports FLAG
       3. passing through the entire list gives status of pass

       The default deid has a greylist, whitelist, then blacklist
       Parameters
       =========
       dicom_file: the fullpath to the file to evaluate
       force: force reading of a potentially erroneous file
       deid: the full path to a deid specification. if not defined, only default used
       deid['filter']['dangerouscookie'] <-- filter list "dangerouscookie"
       --> This is what an item in the criteria looks like
            [{'coordinates': ['0,0,512,110'],
              'filters': [{'InnerOperators': [],
              'action': ['notequals'],
              'field': ['OperatorsName'],
              'operator': 'and',
              'value': ['bold bread']}],
            'name': 'criteria for dangerous cookie'}]

       Returns
       =======
       --> This is what a clean image looks like:
           {'flagged': False, 'results': []}
       --> This is what a flagged image looks like:
          {'flagged': True,
           'results': [
                  {'reason': ' ImageType missing  or ImageType empty ',
                   'group': 'blacklist',
                   'coordinates': []}
               ]
           }
    """
    dicom = read_file(dicom_file, force=force)

    # Load criteria (actions) for flagging
    filters = deid.get_filters()
    if not filters:
        bot.exit("Deid provided does not have %filter, exiting.")

    # Return list with lookup as dicom_file
    results = []
    global_flagged = False
    for name, items in filters.items():  
        for item in items:
            # description for each group across items
            flags = []
            descriptions = []
            for group in item["filters"]:
                group_flags = []  # evaluation for a single line
                group_descriptions = []
                # You cannot pop from the list
                for a in range(len(group["action"])):
                    action = group["action"][a]
                    field = group["field"][a]
                    value = ""

                    if len(group["value"]) > a:
                        value = group["value"][a]

                    flag = apply_filter(
                        dicom=dicom,
                        field=field,
                        filter_name=action,
                        value=value or None,
                    )
                    group_flags.append(flag)
                    description = "%s %s %s" % (field, action, value)

                    if len(group["InnerOperators"]) > a:
                        inner_operator = group["InnerOperators"][a]
                        group_flags.append(inner_operator)
                        description = "%s %s" % (description, inner_operator)

                    group_descriptions.append(description)

                # At the end of a group, evaluate the inner group
                flag = evaluate_group(group_flags)

                # "Operator" is relevant for the outcome of the list of actions
                operator = ""
                if "operator" in group:
                    if group["operator"] is not None:
                        operator = group["operator"]
                        flags.append(operator)

                flags.append(flag)
                reason = ("%s %s" % (operator, " ".join(group_descriptions))).replace(
                    "\n", " "
                )
                descriptions.append(reason)

            # When we parse through a group, we evaluate based on all flags
            flagged = evaluate_group(flags)

            if flagged is True:
                global_flagged = True
                reason = " ".join(descriptions)

                # If coordinates are empty, we derive from dicom
                if item["coordinates"] and "from:" in item["coordinates"][0]:
                    item["coordinates"] = extract_coordinates(
                        dicom, item["coordinates"][0]
                    )

                result = {
                    "reason": reason,
                    "group": name,
                    "coordinates": item["coordinates"],
                }

                results.append(result)

        overall_result = {"flagged": global_flagged, "results": results}
    return overall_result

Second, in the default deid.dicom file, the coordinates (at least for US filters) are set to the same as those included in the PixelAnonymizerScript from CTP. The problem is in CTP these are the pixels that are meant to be removed, however, in deid these are the pixels that form a mask of what to KEEP in the images (coordinates are set to 1 in the mask, all else 0, then mask is multiplied by original image). The result is that the bounding boxes of PHI determined by the detect client are kept rather than discarded. For example if you look at the filter below, these are the coordinates x0=0, y0=0, x1 = 1024, y1=60 that represent the bounding box that contains PHI and should be removed, not kept:

LABEL SIEMMENS # (CTP) contains Modality US

Either the default recipe needs to be changed, or the behavior of the mask needs to be changed (i.e. mask represents pixels to be removed, though this would result in erratic behavior for the SequenceofUltrasoundRegions filter as it would remove the actual ultrasound image every time it is present).

vsoch commented 4 years ago

Thank you for pointing this out! I would definitely be open to a pull request if you'd like to make these needed changes (which you've already figured out and should be given credit for). It would also be good to have a test you think appropriate to ensure that the error does not re-occur.

rmw362 commented 4 years ago

I would be happy to submit a PR (when I have a little bit more time later this week :) ). How would you like to handle the second issue regarding coordinates of PHI? Certainly changing the behavior of the mask is a bigger change (ie mask becomes 0s, everything else 1s) and would require you to remove the SequenceofUltrasoundRegions filter from the default deid.dicom file. Otherwise, someone would need to meticulously go through the list of filters and change them to include the inverse of the coordinates currently provided (that is if all of the provided coordinates come from CTP's PixelAnonymizer script, I only confirmed this behavior for US). If the second approach is desired, not sure I have the time to go through each of these and update (sorry!).

vsoch commented 4 years ago

Great! Definitely no worries on a quick PR, I'll be here when you need me to review!

For the second issue, the change was a response to a user that found this coordinate for Ultrasound data and thought it would be useful to add. I think the original intention was to get a box (the values) and then set the values to 0 within those bounds (see my notes here https://github.com/pydicom/deid/blob/master/deid/dicom/pixels/detect.py#L275) but then you are right, we use the coordinate here https://github.com/pydicom/deid/blob/master/deid/dicom/pixels/clean.py#L164 and create a mask with 1s, and then multiply, so the region that we want blacked out is the one included. If it's the case that the CTP coordinates are the opposite of the UltrasoundRegions (e.g., CTP tells us what to black out, ultrasound regions tells us what to include) then I think a reasonable approach would be to allow for a special case of cleaning, a setting to indicate "clean the inverse of this" (or basically set the value of the mask to 1). Something like:

LABEL Clean Ultrasound Regions
    present SequenceOfUltrasoundRegions
    keepcoordinates from:SequenceOfUltrasoundRegions

So then here we would just do another check:

        elif member.lower().startswith("keepcoordinates"):
            coordinate = member.replace("keepcoordinates", "").strip()
            criteria["keepcoordinates"].append(coordinate)
            continue

And then when we are doing the clean, we would first create a tuple of coordinates (instead of a list) where the first entry is the coordinate, and the second is the value to fill with (0 or 1 depending on keep or not). That would be here. And then we would keep the current loop, but instead of setting the value of the mask to 1 always, we would set it based on the tuple value.

What do you think? If you like, I can do this first round of changes to add the keepcoordinates, and then I'll give you a branch to PR your first change to. That should run the tests across both of our changes. Would that work?

vsoch commented 4 years ago

okay here you go! https://github.com/pydicom/deid/pull/143 Let me know what you think! I can either merge this (and then you can open the other PR separately) or you can PR to this branch if you want to tweak the logic. I'm good with either :)

vsoch commented 4 years ago

hey @rmw362 I'd like to get the fix in asap for the coordinates being reversed - did you see https://github.com/pydicom/deid/pull/143, and would you care to test it? And I'd also appreciate a PR with the first change that you suggested. Thank you!

vsoch commented 4 years ago

I've merged the fix for the keepcoordinates, I don't see the tab that you are referencing in the above huge snippet so I'll wait for your PR (or just showing me so I can make the fix). Thanks!

wetzelj commented 3 years ago

Hey @vsoch & @rmw362 - I came across this bug in my testing as well. However, I believe that the root cause of the issue is where the group_flags and group_descriptions are being initialized to []. Currently this is occurring on line 146/147 in detect.py - within the loop in which we're looping through the conditions set for the filter. This causes only the last definition line of the filter to be used as the only definition for the filter. I believe that the correct fix for this issue is to simply move these two statements above the for group in item["filters"] on line 145 so that the filter rules are appended and evaluated as a group.

I've written several tests evaluating filter conditions with multiple rules as well as filter conditions with inner_operators - so I think the tests cover all of the basic use scenarios.

Please take a look at my branch - fix/compound_filters. If you think this looks okay, I'll submit as a PR.

vsoch commented 3 years ago

@wetzelj looks good! Yes please open the PR!

vsoch commented 3 years ago

Should be closed by #152. Thank you @wetzelj !