stanfordmlgroup / chexpert-labeler

CheXpert NLP tool to extract observations from radiology reports.
MIT License
328 stars 78 forks source link

Modified Loader to load findings and impression sections #36

Closed brentd313 closed 1 year ago

brentd313 commented 1 year ago

Thanks for the feedback! I'd be happy to implement that. I do have a quick question. When extracting multiple sections, I currently create a new BioCPassage and initialize it with the text concatenated from all the extracted sections (because the labeler expects a single passage), but I thought this solution seemed a bit patchy. Would it be preferable to modify the labeler to accept an arbitrary number of passages, or do you think concatenating the extracted sections into a single passage is sufficient?

jirvin16 commented 1 year ago

Thanks for the feedback! I'd be happy to implement that. I do have a quick question. When extracting multiple sections, I currently create a new BioCPassage and initialize it with the text concatenated from all the extracted sections (because the labeler expects a single passage), but I thought this solution seemed a bit patchy. Would it be preferable to modify the labeler to accept an arbitrary number of passages, or do you think concatenating the extracted sections into a single passage is sufficient?

Great to hear, thank you! Good question - given the labeler simply detects all mentions of the curated phrases before detecting negations/uncertainty, I think concatenating simplifies the implementation (if the labeler runs on separate passages, the outputs would have to be combined, but this aggregation happens "for free" when concatenated). Please let me know if that makes sense!

brentd313 commented 1 year ago

Great! That makes a lot of sense. Another case we should consider is when a particular document does not contain any of the sections specified by the --sections_to_extract argument. In our use case, we decided to run the labeler on the original document (i.e. without extracting any sections) if neither a findings nor impression section were present, but I could imagine that in some cases, one may want to return an empty document if none of the specified sections are present. How do you think we should handle this case? If we want to allow the user to determine the behavior, I can add another argument along the lines of --extract_sections_strict that when given, the labeler will run on an empty document if none of the specified sections are present. In this case, the default behavior would be to run the labeler on the original document if none of the specified sections are present, but if you think the default behavior should be to run on an empty document, I can implement that as well.

brentd313 commented 1 year ago

See commit 0f99137 for the labeler modified to take a list of sections to extract. Here are a few notes on the behavior implemented in this commit:

If I'm not mistaken, the current default behavior of the labeler (on the master branch) is to run on the original report documents without extracting any sections. In order to extract the impression section from each report, the --extract_impression flag must be passed. Therefore, the default behavior of the modified labeler implemented in this commit is not actually the same as the current default behavior. I just wanted to bring this to your attention in case you would like for the default behavior to remain consistent.

jirvin16 commented 1 year ago

See commit 0f99137 for the labeler modified to take a list of sections to extract. Here are a few notes on the behavior implemented in this commit:

  • The list of sections to extract can be given as command line arguments in the format --sections_to_extract title1 title2 title3, which will store a list ['title1', 'title2', 'title3']
  • If the --sections_to_extract argument is not given, the stored list will default to ['impression']
  • If none of the sections in sections_to_extract are present in a given report, then the loader will return the original report document without extracting any sections (see previous comment for more discussion on this behavior)
  • If the --sections_to_extract flag is given but no section titles are passed, then the labeler will be run on the original documents as the stored list will be empty

If I'm not mistaken, the current default behavior of the labeler (on the master branch) is to run on the original report documents without extracting any sections. In order to extract the impression section from each report, the --extract_impression flag must be passed. Therefore, the default behavior of the modified labeler implemented in this commit is not actually the same as the current default behavior. I just wanted to bring this to your attention in case you would like for the default behavior to remain consistent.

This is fantastic, thank you!

  1. Nice catch re: default behavior - it would be ideal if that was the same as before (running on the original document without extracting any sections).
  2. I'm on board with the functionality that if none of the requested sections are found (or none are passed) then the loader will return the original document, but it may be helpful to raise a warning to alert the user when this happens.

Besides that the changes look great to me.

brentd313 commented 1 year ago

Great! I just implemented those changes. Now, the labeler will behave as follows:

Before I finalize the PR, I wanted to revisit the behavior in the case that a document does not contain any of the provided sections for extraction. It looks like we agree that (by default) the loader should return the original document in this case, but do you think it is worth adding an argument to modify this behavior if desired? If you think that would be worthwhile, I can add a boolean argument --extract_strict, which would instruct the loader to return an empty document if none of the sections to extract are present. Let me know what you think.

jirvin16 commented 1 year ago

Great! I just implemented those changes. Now, the labeler will behave as follows:

  • By default (i.e. if no section titles are provided using --sections_to_extract), the loader will load the original documents without extracting any sections.
  • If section titles are provided using --sections_to_extract and a document that does not contain any of the given sections is encountered, a warning will be raised and the original document will be returned.

Before I finalize the PR, I wanted to revisit the behavior in the case that a document does not contain any of the provided sections for extraction. It looks like we agree that (by default) the loader should return the original document in this case, but do you think it is worth adding an argument to modify this behavior if desired? If you think that would be worthwhile, I can add a boolean argument --extract_strict, which would instruct the loader to return an empty document if none of the sections to extract are present. Let me know what you think.

That would be a nice addition in case users want that behavior instead! I'm supportive of either including or excluding it, so up to you.

brentd313 commented 1 year ago

I just implemented that functionality. Now, the user can provide the --extract_strict argument, which will instruct the labeler to return an empty document if none of the sections given by --sections_to_extract are present in a given document. I think that should finalize the PR! Let me know if there is anything else you would like me to do.