jakelever / biotext

Get a nicely-chunked local copy of the biomedical literature (to use for other projects)!
MIT License
13 stars 5 forks source link

Citation annotation offset outside passage #9

Closed jakelever closed 1 year ago

jakelever commented 2 years ago

Hey @creisle , I've come across a citation annotation that is outside the associated passage. One of my scripts checks some things on BioC files and this got flagged. I think that it doesn't seem right. What do you think?

Below is an example where the passage offset is 56733 but the zero-length citation is at offset 56732 which is just before the passage starts.

    <passage>
      <infon key="section">floating</infon>
      <infon key="subsection">None</infon>
      <infon key="xml_path">floats-group/table-wrap/table/thead</infon>
      <offset>56733</offset>
      <text> (µg/mL)    S-2366 K M (mM) V MAX (mAU/min)</text>
      <annotation id="ANN_c6f8f533-0764-4484-8663-c18655ca06f3">
        <infon key="citation_text">1</infon>
        <infon key="type">citation</infon>
        <location offset="56732" length="0"/>
        <text/>
      </annotation>
    </passage>

To reproduce, I've included the source PMC XML file: PMC8466798.xml.gz and I converted it with the line below.

python src/convert.py --i PMC008xxxxxx/PMC8466798.xml --iFormat pmcxml --o test.bioc.xml --oFormat biocxml
creisle commented 2 years ago

Ya that seems odd. I'll take a look!

creisle commented 1 year ago

Sorry for the long delay, this completely fell off my radar. I've been debugging this and it looks like the reason this happens is a case where the citation is the first thing in a passage. Now since we remove the in-text citation and attribute it to the text that precedes it this ends up putting the annotation in the wrong passage. Should be an easy fix but first I'd like to see why those passages are being split. Seems like maybe they shouldn't be

creisle commented 1 year ago

image

Ok this one is really weird.... the citation is actually in a strange position in the original text. Would it make sense to be adjusting the position so the reference is the start of the table header passage or should we just append it to the previous passage after the table description?

jakelever commented 1 year ago

Wow, what a weird one. Some bug in the publishers' code to convert to PMC XML. It's probably just better to work with the data that we've got instead of trying fixes that may sometimes work. So I guess insert it into the table header? What'd you think?

creisle commented 1 year ago

ya, that's probably the simplest solution