stanfordmlgroup / chexpert-labeler

CheXpert NLP tool to extract observations from radiology reports.
MIT License
340 stars 79 forks source link

Reports require end punctuation #8

Open kl2532 opened 5 years ago

kl2532 commented 5 years ago

Thanks for open sourcing your labeler! I'm running into the following error with the sample reports:

$ python label.py --reports_path sample_reports.csv
ERROR:root:Cannot process sentence 62 in 0
Traceback (most recent call last):
  File "NegBio/negbio/pipeline/ptb2ud.py", line 109, in convert_doc
    self.add_lemmas)
  File "NegBio/negbio/pipeline/ptb2ud.py", line 183, in convert_dg
    ann = annotations[annotation_id_map[node.index]]
IndexError: list index out of range

I believe the issue is due to the lack of punctuation at the end of the first sample report.

For example, if the input is: Heart size normal and lungs are clear. No edema or pneumonia. No effusion, then the labeled report output is: Heart size normal and lungs are clear. No edema or pneumonia. No effusion,,,0.0,,,0.0,,0.0,,,1.0,,,

However, the example labeled_reports.csv has: Heart size normal and lungs are clear. No edema or pneumonia. No effusion.,1.0,,0.0,,,0.0,,0.0,,,0.0,,,

We can achieve the example labels by modifying the input to Heart size normal and lungs are clear. No edema or pneumonia. No effusion. (added a period to the end of the report). The output is Heart size normal and lungs are clear. No edema or pneumonia. No effusion.,1.0,,0.0,,,0.0,,0.0,,,0.0,,,.

To summarize, do the radiology reports require punctuation at the end of each sentence?

alistairewj commented 5 years ago

This is likely a bug in NegBio - you can checkout my pull request and should find it works (I just tested it).

See https://github.com/ncbi-nlp/NegBio/pull/20