Open ireneisdoomed opened 2 years ago
I think I know the source of it. Looks like reference section. Still I will check them in the morning and if it is from the reference region, I have already modified the code to exclude reference.
@saha-shyamasree I believe this is a different problem than the insertion of references #2570
This is the same piece of text reported in the September 21 submission:
This mechanism plays an important role in the early phase of acute HBV infection because downregulation of HBV replication precedes massive infiltration of CD8+ T cells and manifestation of liver disease 7., Conceivably, therefore, virus-specific T cells may persist not only in the peripheral blood, but also in the liver, as recently reported for intrahepatic hepatitis C virus (HCV)-specific CD8+ T cells after recovery from acute, self-limited HCV infection 14., Individuals with acute, self-limited HBV infection characteristically mount a vigorous, polyclonal, and multispecific Th and CTL response to epitopes within the HBV envelope (HBe), nucleocapsid, and polymerase proteins that is readily detectable in the peripheral blood., Thus, the results that Maini et al. obtained in persistently infected, HBeAg− patients with low viral load 6, and additional studies by other investigators 7 8 are leading to a new appreciation of the function of HBV-specific CD8+ T cells favoring a protective rather than a pathogenic role in HBV infection.
As you can see, the text is composed of sentences from parts 1, 3, and 4 as described above. So this issue was already there prior to the changes introduced in Jan 22.
Ths JSON file I have contains following, as you see, those sentences are separate. So I am rather confused how you are ending up with that sentence.
"matches": [{"label": "Igf2", "type": "GP", "startInSentence": 53, "endInSentence": 57}]}, {"text": "This is of interest in relation to the loss of imprinting of IGF2 that occurs in the human genetic disorder Beckwith Wiedemann syndrome (BWS), which is associated with fetal overgrowth and predisposition to childhood tumors.", "section": "Other", "matches": [{"label": "IGF2", "type": "GP", "startInSentence": 61, "endInSentence": 65}, {"label": "genetic disorder", "type": "DS", "startInSentence": 91, "endInSentence": 107}, {"label": "Beckwith Wiedemann syndrome", "type": "DS", "startInSentence": 108, "endInSentence": 135}, {"label": "BWS", "type": "DS", "startInSentence": 137, "endInSentence": 140}, {"label": "childhood tumors", "type": "DS", "startInSentence": 207, "endInSentence": 223}], "co-occurrence": [{"start1": 61, "end1": 65, "label1": "IGF2", "start2": 91, "end2": 107, "label2": "genetic disorder", "type": "GP-DS", "sentEvidenceScore": 1, "association": 0}, {"start1": 61, "end1": 65, "label1": "IGF2", "start2": 108, "end2": 135, "label2": "Beckwith Wiedemann syndrome", "type": "GP-DS", "sentEvidenceScore": 1, "association": 0}, {"start1": 61, "end1": 65, "label1": "IGF2", "start2": 137, "end2": 140, "label2": "BWS", "type": "GP-DS", "sentEvidenceScore": 1, "association": 0}, {"start1": 61, "end1": 65, "label1": "IGF2", "start2": 207, "end2": 223, "label2": "childhood tumors", "type": "GP-DS", "sentEvidenceScore": 1, "association": 0}]}, {"text": "Enlargement of the tongue is the most consistent feature of BWS, a feature that might correspond to the strong reactivation of Igf2, a potent growth factor, in the mouse tongue following deletion of the muscle-specific silencer.", "section": "Other",
@ireneisdoomed has this been resolved?
Describe the bug
The text provided by the EPMC pipeline is made up of sentences from different parts of the articles, which means that the resulting text does not correspond to the original.
Observed behaviour
EPMC's pipeline should normally isolate individual sentences being faithful to the original text. In fact, what is doing is concatenating sentences from different parts of the text. We haven't measured the extent of this bug, but we believe this could be a widespread bug.
An example is the PMID 11178262. This is the reported text where cooccurrences are tagged:
This piece of text, analysed as a whole in the pipeline consists of chunks of text from different parts:
Silencers
.Phenotypes
section.Phenotypes
section.Expected behaviour
EPMC's pipeline should only analyse text made from sentences in the correct order marked by the manuscript.
@saha-shyamasree We need to investigate whether these Frankensteinian sentences are the result of the sentencizer algorithm or whether the problem comes from the source XML files.
Additional context
I've used latest cooccurrences dataset (
gs://open-targets-pre-data-releases/22.04/output/literature/parquet/cooccurrences
), which is the result of grounding EPMC's latest submission (gs://otar025-epmc/22.01
).