Closed brentd313 closed 1 year ago
Thanks for the feedback! I'd be happy to implement that. I do have a quick question. When extracting multiple sections, I currently create a new BioCPassage and initialize it with the text concatenated from all the extracted sections (because the labeler expects a single passage), but I thought this solution seemed a bit patchy. Would it be preferable to modify the labeler to accept an arbitrary number of passages, or do you think concatenating the extracted sections into a single passage is sufficient?
Great to hear, thank you! Good question - given the labeler simply detects all mentions of the curated phrases before detecting negations/uncertainty, I think concatenating simplifies the implementation (if the labeler runs on separate passages, the outputs would have to be combined, but this aggregation happens "for free" when concatenated). Please let me know if that makes sense!
Great! That makes a lot of sense. Another case we should consider is when a particular document does not contain any of the sections specified by the --sections_to_extract
argument. In our use case, we decided to run the labeler on the original document (i.e. without extracting any sections) if neither a findings nor impression section were present, but I could imagine that in some cases, one may want to return an empty document if none of the specified sections are present. How do you think we should handle this case? If we want to allow the user to determine the behavior, I can add another argument along the lines of --extract_sections_strict
that when given, the labeler will run on an empty document if none of the specified sections are present. In this case, the default behavior would be to run the labeler on the original document if none of the specified sections are present, but if you think the default behavior should be to run on an empty document, I can implement that as well.
See commit 0f99137
for the labeler modified to take a list of sections to extract. Here are a few notes on the behavior implemented in this commit:
--sections_to_extract title1 title2 title3
, which will store a list ['title1', 'title2', 'title3']
--sections_to_extract
argument is not given, the stored list will default to ['impression']
sections_to_extract
are present in a given report, then the loader will return the original report document without extracting any sections (see previous comment for more discussion on this behavior)--sections_to_extract
flag is given but no section titles are passed, then the labeler will be run on the original documents as the stored list will be emptyIf I'm not mistaken, the current default behavior of the labeler (on the master branch) is to run on the original report documents without extracting any sections. In order to extract the impression section from each report, the --extract_impression
flag must be passed. Therefore, the default behavior of the modified labeler implemented in this commit is not actually the same as the current default behavior. I just wanted to bring this to your attention in case you would like for the default behavior to remain consistent.
See commit
0f99137
for the labeler modified to take a list of sections to extract. Here are a few notes on the behavior implemented in this commit:
- The list of sections to extract can be given as command line arguments in the format
--sections_to_extract title1 title2 title3
, which will store a list['title1', 'title2', 'title3']
- If the
--sections_to_extract
argument is not given, the stored list will default to['impression']
- If none of the sections in
sections_to_extract
are present in a given report, then the loader will return the original report document without extracting any sections (see previous comment for more discussion on this behavior)- If the
--sections_to_extract
flag is given but no section titles are passed, then the labeler will be run on the original documents as the stored list will be emptyIf I'm not mistaken, the current default behavior of the labeler (on the master branch) is to run on the original report documents without extracting any sections. In order to extract the impression section from each report, the
--extract_impression
flag must be passed. Therefore, the default behavior of the modified labeler implemented in this commit is not actually the same as the current default behavior. I just wanted to bring this to your attention in case you would like for the default behavior to remain consistent.
This is fantastic, thank you!
Besides that the changes look great to me.
Great! I just implemented those changes. Now, the labeler will behave as follows:
--sections_to_extract
), the loader will load the original documents without extracting any sections.--sections_to_extract
and a document that does not contain any of the given sections is encountered, a warning will be raised and the original document will be returned.Before I finalize the PR, I wanted to revisit the behavior in the case that a document does not contain any of the provided sections for extraction. It looks like we agree that (by default) the loader should return the original document in this case, but do you think it is worth adding an argument to modify this behavior if desired? If you think that would be worthwhile, I can add a boolean argument --extract_strict
, which would instruct the loader to return an empty document if none of the sections to extract are present. Let me know what you think.
Great! I just implemented those changes. Now, the labeler will behave as follows:
- By default (i.e. if no section titles are provided using
--sections_to_extract
), the loader will load the original documents without extracting any sections.- If section titles are provided using
--sections_to_extract
and a document that does not contain any of the given sections is encountered, a warning will be raised and the original document will be returned.Before I finalize the PR, I wanted to revisit the behavior in the case that a document does not contain any of the provided sections for extraction. It looks like we agree that (by default) the loader should return the original document in this case, but do you think it is worth adding an argument to modify this behavior if desired? If you think that would be worthwhile, I can add a boolean argument
--extract_strict
, which would instruct the loader to return an empty document if none of the sections to extract are present. Let me know what you think.
That would be a nice addition in case users want that behavior instead! I'm supportive of either including or excluding it, so up to you.
I just implemented that functionality. Now, the user can provide the --extract_strict
argument, which will instruct the labeler to return an empty document if none of the sections given by --sections_to_extract
are present in a given document. I think that should finalize the PR! Let me know if there is anything else you would like me to do.
Thanks for the feedback! I'd be happy to implement that. I do have a quick question. When extracting multiple sections, I currently create a new BioCPassage and initialize it with the text concatenated from all the extracted sections (because the labeler expects a single passage), but I thought this solution seemed a bit patchy. Would it be preferable to modify the labeler to accept an arbitrary number of passages, or do you think concatenating the extracted sections into a single passage is sufficient?