Closed HMJiangGatech closed 3 years ago
Thanks for the note. The PubTator format is not very strictly defined, so it's possible that some edge cases aren't covered yet.
Is this a public dataset, so I can have a look what exactly is violating bconv
's expectations?
Yes, it is. But it seems very noisy, and has a lot of mismatched entities (even after I made bconv compatible).
ftp://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/Peng2016CID/CID.PubTator.txt.zip
Okay, many entries have an additional 7th field containing "Dictionary" or "Dictionary-Abb":
$ head -n 4 NCBI-pfizerCDPubMed.PubTator | tail -n 1
10023282 24 29 spasm Disease D013035 Dictionary
whereas the "specs" only mention 6 fields (with the last one being optional).
I'm not so sure what to do with that. At least, the error message needs to be better; I'll add a patch for that.
Thanks! I am closing this issue, as I also don't see a proper way to process that file.
Thanks. I just pushed a couple of pending commits and bumped the version. The changes include a patch for an improved error message, but not a fix for the Pfizer-CTD format.
I am trying to convert CTD-Pfizer dataset to CoNLL format by
But I got the following error:
I am using the current version from Github