lfurrer / bconv

Python library for converting between BioNLP formats
MIT License
20 stars 3 forks source link

pubtator to conll not working #1

Closed HMJiangGatech closed 3 years ago

HMJiangGatech commented 3 years ago

I am trying to convert CTD-Pfizer dataset to CoNLL format by

import bconv.bconv as bconv
coll = bconv.load('NCBI-pfizerCDPubMed.PubTator', fmt='pubtator')
with open("convert.conll", 'w', encoding='utf8') as fout:
    bconv.dump(coll, fout, fmt='conll', tagset='IOB', include_offsets=True)

But I got the following error:

  File "convert.py", line 2, in <module>
    coll = bconv.load('NCBI-pfizerCDPubMed.PubTator', fmt='pubtator')
  File "**/bconv/bconv/fmt/__init__.py", line 75, in load
    return _load(loader, mode, source, id_)
  File "**/bconv/bconv/fmt/__init__.py", line 82, in _load
    content = loader.load_one(source, id_)
  File "**/bconv/bconv/fmt/_load.py", line 52, in load_one
    return self.collection(source, id_)
  File "**/bconv/bconv/fmt/pubtator.py", line 36, in collection
    return Collection.from_iterable(docs, id_, basename(source))
  File "**/bconv/bconv/doc/document.py", line 364, in from_iterable
    for doc in documents:
  File "**/bconv/bconv/fmt/pubtator.py", line 44, in _iter_documents
    yield self._document(doc_lines, entity_counter)
  File "**/bconv/bconv/fmt/pubtator.py", line 61, in _document
    docid, sections, anno = self._parse(lines, entity_counter)
  File "**/bconv/bconv/fmt/pubtator.py", line 82, in _parse
    anno.append(self._entity(entity_counter, *fields))
TypeError: _entity() takes from 6 to 7 positional arguments but 8 were given

I am using the current version from Github

lfurrer commented 3 years ago

Thanks for the note. The PubTator format is not very strictly defined, so it's possible that some edge cases aren't covered yet. Is this a public dataset, so I can have a look what exactly is violating bconv's expectations?

HMJiangGatech commented 3 years ago

Yes, it is. But it seems very noisy, and has a lot of mismatched entities (even after I made bconv compatible).

ftp://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/Peng2016CID/CID.PubTator.txt.zip

lfurrer commented 3 years ago

Okay, many entries have an additional 7th field containing "Dictionary" or "Dictionary-Abb":

$ head -n 4 NCBI-pfizerCDPubMed.PubTator | tail -n 1
10023282    24  29  spasm   Disease D013035 Dictionary

whereas the "specs" only mention 6 fields (with the last one being optional).

I'm not so sure what to do with that. At least, the error message needs to be better; I'll add a patch for that.

HMJiangGatech commented 3 years ago

Thanks! I am closing this issue, as I also don't see a proper way to process that file.

lfurrer commented 3 years ago

Thanks. I just pushed a couple of pending commits and bumped the version. The changes include a patch for an improved error message, but not a fix for the Pfizer-CTD format.