petermr / docanalysis

Semantic analysis of text documents including sentence and paragraph splitting
Apache License 2.0
12 stars 3 forks source link

scispacy entity results are hit and miss #32

Open EmanuelFaria opened 1 year ago

EmanuelFaria commented 1 year ago

I just noticed something strange...

I filtered the scispacy.csv to show only rows containing:

  1. ((sentences containing TNF) AND (entities containing TNF)) (see scispacy_match.pdf attached)
  2. ((sentences containing TNF) AND (entities **NOT** containing TNF)) (see scispacy_mismatch.pdf attached)

The latter turned up a bunch of results where TNF was not recognized as an entity in the sentence. I don't see why it should detect entities sometimes and not others.


Another thing I noticed was I found a bunch of sentences with this typo: TNF-<space>𝛼 (TNF- 𝛼) scispacy caught the "TNF-" but left out the alpha because of the space after the dash. (See scispacy_TNF-space.pdf attached). I don't know if there's anything we can do about that, but I thought it should be noted.

scispacy_TNF-space.pdf scispacy_mismatch.pdf scispacy_match.pdf

scispacy.csv