lfurrer / bconv

Python library for converting between BioNLP formats
MIT License
21 stars 3 forks source link

pubtator bioc json error - fix bioc json reader? or add new format "pubtator bioc json"? #5

Open joelduerksen opened 3 years ago

joelduerksen commented 3 years ago

I'm attempting to use bconv to convert BioC JSON to pubtator/TXT, but it throws an error (on validate spanning?). At a glance format appears compliant, but maybe we need a new format called pubtator bioc json?

Files I'm attempting to convert can be found here

ftp://ftp.ncbi.nlm.nih.gov/pub/lu/CORD19/cord19-pubtator.json.tar

first few lines from output/1.json seem to align with the BioC json format.


{ "source": "PubTator", "date": "", "key": "BioC.key", "infons": {}, "documents": [ { "id": "xqhn0vbp", "infons": {}, "passages": [ { "offset": 0, "infons": { .....

lfurrer commented 3 years ago

Hi Joel, is this PubTator Central? Their BioC JSON looks funny. AFAIK there's no specs for BioC JSON besides the converter code by Don Comeau, to which I've been sticking. If they provide BioC XML, you should give that a try; it looked fine when I last checked.

lfurrer commented 3 years ago

... unless this is a simple offset problem that can be fixed with the bytes_offset option; have you checked that?

joelduerksen commented 3 years ago

Hi Lenz, Yes I believe these are generated directly by pubtator (and yes most likely central) more on that here https://github.com/ncbi-nlp/PubTator-Covid19/ where they say "Pubtator annotations are provided for six entity types (gene/protein, drug/chemical, disease, cell type, species and genomic variants) in two formats (BioC JSON and BioC XML)."

On Sun, Jan 10, 2021 at 1:20 PM Lenz Furrer notifications@github.com wrote:

Hi Joel, is this PubTator Central? Their BioC JSON looks funny. AFAIK there's no specs for BioC JSON besides the converter code by Don Comeau https://github.com/ncbi-nlp/BioC-JSON, to which I've been sticking. If they provide BioC XML, you should give that a try; it looked fine when I last checked.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lfurrer/bconv/issues/5#issuecomment-757520268, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJL6NETNRLL6A2LPKVH3VVDSZHVV5ANCNFSM4V4CTEPA .

-- Joel L. Duerksen joellduerksen@gmail.com Home: 321-549-7210 Cell: 317-289-1036

joelduerksen commented 3 years ago

Hi Lenz,

I have not checked into using that option, having mostly used the simple pubtator/TXT format I'm not familiar with the inner workings of the json/xml format(s). if you have any hints on how that option could help let me know.

Here are example full errors I see

JSON

with open('/home/plastic/d2/downloads_other/cord19/output.json/1.json', encoding='utf8') as f:

... coll = bconv.load(f, fmt='bioc_json')

...

Traceback (most recent call last):

File "", line 2, in

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/init.py", line 77, in load

return _load(loader, mode, source, id_)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/init.py", line 84, in _load

content = loader.load_one(source, id_)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/_load.py", line 52, in load_one

return self.collection(source, id_)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py", line 62, in collection

collection.add_document(self._document(doc))

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py", line 79, in _document

doc.add_section(sec_type, text, offset, anno)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py", line 426, in add_section

section = Section(section_type, text, self, offset, entities)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py", line 295, in init

self.add_entities(entities)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py", line 154, in add_entities

sent.add_entities((entity,))

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py", line 194, in add_entities

self._validate_spans(entity)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py", line 211, in _validate_spans

assert extracted[0] == entity.text, _mismatch()

AssertionError: entity mention mismatch: rhinovirus vs. [', rhinovir']

I tried xml as well,

XML

with open('/home/plastic/d2/downloads_other/cord19/output/1.xml', encoding='utf8') as f:

... coll = bconv.load(f, fmt='bioc_xml')

...

Traceback (most recent call last):

File "", line 2, in

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/init.py", line 77, in load

return _load(loader, mode, source, id_)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/init.py", line 84, in _load

content = loader.load_one(source, id_)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/_load.py", line 52, in load_one

return self.collection(source, id_)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py", line 58, in collection

coll_node, docs = self._parse_collection(source)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py", line 158, in _parse_collection

first, docs = peek(self._iterparse(source))

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/util/iterate.py", line 36, in peek

first = next(iterator)

File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py", line 170, in _iterparse

for _, node in etree.iterparse(source, tag='document'):

File "src/lxml/iterparse.pxi", line 209, in lxml.etree.iterparse.next

File "src/lxml/iterparse.pxi", line 194, in lxml.etree.iterparse.next

File "src/lxml/iterparse.pxi", line 222, in lxml.etree.iterparse._read_more_events

TypeError: reading file objects must return bytes objects

I'm guessing we might need two new formats pubtator/json and pubtator/xml? since you said it looked weird inside. this cord19 site is creating regular updates, but not providing pubtator/TXT download, hence the desire to convert.

On Sun, Jan 10, 2021 at 1:22 PM Lenz Furrer notifications@github.com wrote:

... unless this is a simple offset problem that can be fixed with the bytes_offset option https://github.com/lfurrer/bconv/wiki/BioC#options-1; have you checked that?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lfurrer/bconv/issues/5#issuecomment-757520740, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJL6NEVGV5FDKET55AA6MTDSZHWAFANCNFSM4V4CTEPA .

-- Joel L. Duerksen joellduerksen@gmail.com Home: 321-549-7210 Cell: 317-289-1036

lfurrer commented 3 years ago

I'm having a look at the files right now. It seems that setting byte_offsets=False helps, but there are cases where it still breaks. Concerning the error you see for XML: You need to pass a binary file handle here. XML is, technically, a binary format, not plain text (at least that's what lxml's author claims).

lfurrer commented 3 years ago

Here's my quick analysis of the problem: We definitely don't need a new format; the documents appear to be well-formed (my above suspicion about "funny" BioC-JSON does not apply). Rather, there is a mismatch in the interpretation of the BioC specs between bconv and PubTator. Also, the data contain some errors.

First, as I said before, you should turn off the byte_offset option. The BioC specs dictate that offsets are calculated in bytes, but many disregard this detail and simply count Unicode codepoints, which arguably makes more sense, and that's why there's an option for this in bconv.

Second, bconv is pretty strict in its span validation, because that's how you notice that you should be turning on or off the byte_offset option. The BioC annotations have a text and a location field, which is a bit redundant, so we can use it to do a sanity check by looking up the substring and comparing it to the text value. Now it turns out that PubTator is doing some normalisation and stores the normalised version in text rather than the original one, so bconv barks at you. Examples:

The BioC DTD says that the text field of annotations is "Typically the annotated text", so bconv's interpretation is possibly a bit too strict. I could add an option to skip validation, so these cases would pass, but then actual errors wouldn't be detected either.

Third, there are errors in the data. A questionable case is the "off-camp"/"analysis" one above. A clear instance is the following: 1992.json contains two occurrences of "mefloquine", the second of which (at offset 20015) is annotated twice: once with the correct location and once with offset 18070 (the first occurrence), which is outside the paragraph at which it is anchored (starting at offset 19247). The same pattern can be seen for "fatty acid" in 2952.json. It seems like both cases appear in duplicate paragraphs or documents, which might be responsible for the spurious annotations.

In conclusion, I'm not so sure what to do. I'm not convinced that all of these problems should be fixed at bconv's end. You may want to reach out to the authors of CORD-19-PubTator. Chances are they want to fix problems like the last one in their pipeline.

joelduerksen commented 3 years ago

I believe these files are generated by the creators of the pubtator/txt, pubtator/json, pubtator/xml format. (so it might be an interesting discussion to argue they are creating their own format/files wrong) ftp://ftp.ncbi.nlm.nih.gov/pub/lu/

On Sun, Jan 10, 2021 at 5:17 PM Lenz Furrer notifications@github.com wrote:

Here's my quick analysis of the problem: We definitely don't need a new format; the documents appear to be well-formed (my above suspicion about "funny" BioC-JSON does not apply). Rather, there is a mismatch in the interpretation of the BioC specs between bconv and PubTator. Also, the data contain some errors.

First, as I said before, you should turn off the byte_offset option. The BioC specs dictate that offsets are calculated in bytes, but many disregard this detail and simply count Unicode codepoints, which arguably makes more sense, and that's why there's an option for this in bconv.

Second, bconv is pretty strict in its span validation, because that's how you notice that you should be turning on or off the byte_offset option. The BioC annotations have a text and a location field, which is a bit redundant, so we can use it to do a sanity check by looking up the substring and comparing it to the text value. Now it turns out that PubTator is doing some normalisation and stores the normalised version in text rather than the original one, so bconv barks at you. Examples:

  • 3321.json: PubTator: "SARS CoV 2", original: "SARS‐CoV‐2"
  • 2025.json: PubTator: "TNF-a", original: "TNF-α"
  • 3097.json: PubTator: "IL-1b", original: "IL-1ß" (with German sharp-s for beta 🙈)
  • 785.json: PubTator: "off-camp", original: "analysis" (I'm not sure if this is a valid synonym or an error)

The BioC DTD says that the text field of annotations is "Typically the annotated text", so bconv's interpretation is possibly a bit too strict. I could add an option to skip validation, so these cases would pass, but then actual errors wouldn't be detected either.

Third, there are errors in the data. A questionable case is the "off-camp"/"analysis" one above. A clear instance is the following: 1992.json contains two occurrences of "mefloquine", the second of which (at offset 20015) is annotated twice: once with the correct location and once with offset 18070 (the first occurrence), which is outside the paragraph at which it is anchored (starting at offset 19247).

In conclusion, I'm not so sure what to do. I'm not convinced that all of these problems should be fixed at bconv's end. You may want to reach out to the authors of CORD-19-PubTator. Chances are they want to fix problems like the last one in their pipeline.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lfurrer/bconv/issues/5#issuecomment-757553578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJL6NERR27XWR52MCHNB3QDSZIRP3ANCNFSM4V4CTEPA .

-- Joel L. Duerksen joellduerksen@gmail.com Home: 321-549-7210 Cell: 317-289-1036

joelduerksen commented 3 years ago

I can't make sense of these offsets (they do seem to be correct for the few I checked in the title field but every entry I checked in the text wasn't at that offset), I guess it is some kind of programmatic approach that can't be checked in a text viewer. (e.g. vi)

However, the problems are deeper, and I'll probably write them about that first, compare the output to their own service output, and we see these files are missing annotation content as well. https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson?pmids=19672853 https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/pubtator?pmids=19672853

I wrote them about issues with another dataset, and while they didn't respond, it was fixed in the next update. (coincidence, maybe, but...)

On Mon, Jan 11, 2021 at 10:36 AM Joel Duerksen joellduerksen@gmail.com wrote:

I believe these files are generated by the creators of the pubtator/txt, pubtator/json, pubtator/xml format. (so it might be an interesting discussion to argue they are creating their own format/files wrong) ftp://ftp.ncbi.nlm.nih.gov/pub/lu/

On Sun, Jan 10, 2021 at 5:17 PM Lenz Furrer notifications@github.com wrote:

Here's my quick analysis of the problem: We definitely don't need a new format; the documents appear to be well-formed (my above suspicion about "funny" BioC-JSON does not apply). Rather, there is a mismatch in the interpretation of the BioC specs between bconv and PubTator. Also, the data contain some errors.

First, as I said before, you should turn off the byte_offset option. The BioC specs dictate that offsets are calculated in bytes, but many disregard this detail and simply count Unicode codepoints, which arguably makes more sense, and that's why there's an option for this in bconv.

Second, bconv is pretty strict in its span validation, because that's how you notice that you should be turning on or off the byte_offset option. The BioC annotations have a text and a location field, which is a bit redundant, so we can use it to do a sanity check by looking up the substring and comparing it to the text value. Now it turns out that PubTator is doing some normalisation and stores the normalised version in text rather than the original one, so bconv barks at you. Examples:

  • 3321.json: PubTator: "SARS CoV 2", original: "SARS‐CoV‐2"
  • 2025.json: PubTator: "TNF-a", original: "TNF-α"
  • 3097.json: PubTator: "IL-1b", original: "IL-1ß" (with German sharp-s for beta 🙈)
  • 785.json: PubTator: "off-camp", original: "analysis" (I'm not sure if this is a valid synonym or an error)

The BioC DTD says that the text field of annotations is "Typically the annotated text", so bconv's interpretation is possibly a bit too strict. I could add an option to skip validation, so these cases would pass, but then actual errors wouldn't be detected either.

Third, there are errors in the data. A questionable case is the "off-camp"/"analysis" one above. A clear instance is the following: 1992.json contains two occurrences of "mefloquine", the second of which (at offset 20015) is annotated twice: once with the correct location and once with offset 18070 (the first occurrence), which is outside the paragraph at which it is anchored (starting at offset 19247).

In conclusion, I'm not so sure what to do. I'm not convinced that all of these problems should be fixed at bconv's end. You may want to reach out to the authors of CORD-19-PubTator. Chances are they want to fix problems like the last one in their pipeline.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lfurrer/bconv/issues/5#issuecomment-757553578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJL6NERR27XWR52MCHNB3QDSZIRP3ANCNFSM4V4CTEPA .

-- Joel L. Duerksen joellduerksen@gmail.com Home: 321-549-7210 Cell: 317-289-1036

-- Joel L. Duerksen joellduerksen@gmail.com Home: 321-549-7210 Cell: 317-289-1036