EPMC text is not correspondent to the original publication

ireneisdoomed commented 2 years ago

Describe the bug

The text provided by the EPMC pipeline is made up of sentences from different parts of the articles, which means that the resulting text does not correspond to the original.

Observed behaviour

EPMC's pipeline should normally isolate individual sentences being faithful to the original text. In fact, what is doing is concatenating sentences from different parts of the text. We haven't measured the extent of this bug, but we believe this could be a widespread bug.

An example is the PMID 11178262. This is the reported text where cooccurrences are tagged:

This is of interest in relation to the loss of imprinting of IGF2 that occurs in the human genetic disorder Beckwith Wiedemann syndrome (BWS), which is associated with fetal overgrowth and predisposition to childhood tumors., Enlargement of the tongue is the most consistent feature of BWS, a feature that might correspond to the strong reactivation of Igf2, a potent growth factor, in the mouse tongue following deletion of the muscle-specific silencer.,
Brown Brown KW KW Villar Villar AJ AJ Bickmore Bickmore W W Clayton-Smith Clayton-Smith J J Catchpoole Catchpoole D D Maher Maher ER ER Reik Reik W W Imprinting mutation in the Beckwith-Wiedemann syndrome leads to biallelic  IGF2 IGF2  expression through an  H19 H19  independent pathway., The situation in mice may be pertinent to that in humans, where loss of imprinting of IGF2 can occur without altering the imprinting of H19 in hepatoblastoma and in many patients with BWS [20,21,22]., The Igf2 silencers identified in mice add to the ever-increasing number of elements controlling the imprinting of Igf2; they provide additional targets for mutations that can lead to disruption of imprinting, and to diseases including cancer., Joyce Joyce JA JA Lam Lam WK WK Catchpoole Catchpoole DJ DJ Jenks Jenks P P Reik Reik W W Maher Maher ER ER Schofield Schofield PN PN Imprinting of  IGF2 IGF2  and  H19 H19 : lack of reciprocity in sporadic Beckwith-Wiedemann syndrome.

This piece of text, analysed as a whole in the pipeline consists of chunks of text from different parts:

"This is of interest in relation to ... following deletion of the muscle-specific silencer." -> In the article, this is the final sentence of the section Silencers.
"Brown Brown KW ... independent pathway" -> In the article, this is the citation of reference # 21. More details in #2570
"The situation in mice may be pertinent ... and in many patients with BWS [20,21,22]" -> In the article, this is the third last sentence of the Phenotypes section.
"The Igf2 silencers identified in mice add ... and to diseases including cancer." -> In the article, this is the last sentence of the Phenotypes section.
"Joyce Joyce JA JA ... Beckwith-Wiedemann syndrome." -> In the article, this is the citation of reference # 22. More details in #2570

Expected behaviour

EPMC's pipeline should only analyse text made from sentences in the correct order marked by the manuscript.

@saha-shyamasree We need to investigate whether these Frankensteinian sentences are the result of the sentencizer algorithm or whether the problem comes from the source XML files.

Additional context

I've used latest cooccurrences dataset (gs://open-targets-pre-data-releases/22.04/output/literature/parquet/cooccurrences), which is the result of grounding EPMC's latest submission (gs://otar025-epmc/22.01).

saha-shyamasree commented 2 years ago

I think I know the source of it. Looks like reference section. Still I will check them in the morning and if it is from the reference region, I have already modified the code to exclude reference.

ireneisdoomed commented 2 years ago

@saha-shyamasree I believe this is a different problem than the insertion of references #2570

This is the same piece of text reported in the September 21 submission:

This mechanism plays an important role in the early phase of acute HBV infection because downregulation of HBV replication precedes massive infiltration of CD8+ T cells and manifestation of liver disease 7., Conceivably, therefore, virus-specific T cells may persist not only in the peripheral blood, but also in the liver, as recently reported for intrahepatic hepatitis C virus (HCV)-specific CD8+ T cells after recovery from acute, self-limited HCV infection 14., Individuals with acute, self-limited HBV infection characteristically mount a vigorous, polyclonal, and multispecific Th and CTL response to epitopes within the HBV envelope (HBe), nucleocapsid, and polymerase proteins that is readily detectable in the peripheral blood., Thus, the results that Maini et al. obtained in persistently infected, HBeAg− patients with low viral load 6, and additional studies by other investigators 7 8 are leading to a new appreciation of the function of HBV-specific CD8+ T cells favoring a protective rather than a pathogenic role in HBV infection.

As you can see, the text is composed of sentences from parts 1, 3, and 4 as described above. So this issue was already there prior to the changes introduced in Jan 22.

saha-shyamasree commented 2 years ago

Ths JSON file I have contains following, as you see, those sentences are separate. So I am rather confused how you are ending up with that sentence.

"matches": [{"label": "Igf2", "type": "GP", "startInSentence": 53, "endInSentence": 57}]}, {"text": "This is of interest in relation to the loss of imprinting of IGF2 that occurs in the human genetic disorder Beckwith Wiedemann syndrome (BWS), which is associated with fetal overgrowth and predisposition to childhood tumors.", "section": "Other", "matches": [{"label": "IGF2", "type": "GP", "startInSentence": 61, "endInSentence": 65}, {"label": "genetic disorder", "type": "DS", "startInSentence": 91, "endInSentence": 107}, {"label": "Beckwith Wiedemann syndrome", "type": "DS", "startInSentence": 108, "endInSentence": 135}, {"label": "BWS", "type": "DS", "startInSentence": 137, "endInSentence": 140}, {"label": "childhood tumors", "type": "DS", "startInSentence": 207, "endInSentence": 223}], "co-occurrence": [{"start1": 61, "end1": 65, "label1": "IGF2", "start2": 91, "end2": 107, "label2": "genetic disorder", "type": "GP-DS", "sentEvidenceScore": 1, "association": 0}, {"start1": 61, "end1": 65, "label1": "IGF2", "start2": 108, "end2": 135, "label2": "Beckwith Wiedemann syndrome", "type": "GP-DS", "sentEvidenceScore": 1, "association": 0}, {"start1": 61, "end1": 65, "label1": "IGF2", "start2": 137, "end2": 140, "label2": "BWS", "type": "GP-DS", "sentEvidenceScore": 1, "association": 0}, {"start1": 61, "end1": 65, "label1": "IGF2", "start2": 207, "end2": 223, "label2": "childhood tumors", "type": "GP-DS", "sentEvidenceScore": 1, "association": 0}]}, {"text": "Enlargement of the tongue is the most consistent feature of BWS, a feature that might correspond to the strong reactivation of Igf2, a potent growth factor, in the mouse tongue following deletion of the muscle-specific silencer.", "section": "Other",

prashantuniyal02 commented 1 year ago

@ireneisdoomed has this been resolved?

opentargets / issues