petermr / docanalysis

Semantic analysis of text documents including sentence and paragraph splitting
Apache License 2.0
12 stars 3 forks source link

Anomalous entity entries in CSV output #33

Open EmanuelFaria opened 1 year ago

EmanuelFaria commented 1 year ago

ISSUE: Using sciSpacy, to create CSV I noted a number of issues in the entities it found and the labels attached to them.

NOTE: I suspect at least some of the problem could be due to the output being comma-delimited, so I propose we try with tab-delimited output, and I'll re-run this test corpus and compare.

In the attached PDF and CSV you'll see I added two new columns — Anomaly and issue. (I did not identify the issue for most of these, but you'll get the gist)

In this case I was focusing on what entities were mis-labeled as DISEASE

The types of errors include the following being identified as DISEASE:

Also, noticed some mis-labeled as CHEMICAL

Also, many Abbreviations came up as entities, but were not expanded in the abbreviations_longform column