howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Exclude from training set things that were thought might be software? #586

Open jameshowison opened 5 years ago

jameshowison commented 5 years ago

I just wondered what @kermit2 is doing with the in-text mentions coded as "not software" (e.g. algorithms or databases). Are they included as negative examples or are they dropped as confusing sentences (like sentences with coder disagreements)?

kermitt2 commented 5 years ago

Currently they are completely dropped, I only keep contexts with mention type software. It's a good idea to use them as negative examples - currently I only use as negative examples, random contexts without any annotations. Annotation errors from the recognizer that I see currently frequently are annotated names of projects, algorithms and datasets/database.