Closed kermitt2 closed 3 years ago
error case via https://github.com/DataSeer/dataseer-web/issues/441
It seems that the PDF is not reachable 😺
Sorry poor internet connection :(
Normally the text to be segmented includes the references (all text including descendant elements):
and we only keep track of the positions of the references to pass the "forbidden positions" to the segmenter:
It seems that until that step, it works fine, the texts of the sentences look good.
The problem is probably then we try to group the LayoutToken corresponding of each sentence in segmentedParagraphTokens
. The text segmented is coming from XML is a bit different than the text from LayoutToken (de-hyphenization, some spaces removed), and the alignment can be challenging.
PR #821 fixes the problem, which was due to a leftover in the reference pattern (year pattern) missing in the XML.
All the coordinates for sentence elements now look good:
<div
xmlns="http://www.tei-c.org/ns/1.0">
<head coords="23,54.00,212.69,163.38,11.14">ODD-luciferase activity assaay</head>
<p>
<s coords="23,54.00,240.29,504.00,11.14;23,54.00,267.89,158.47,11.14">The ODD-luciferase construct with pcDNA3.1 plasmid vector was constructed as previously described
<ref type="bibr" coords="23,108.63,267.89,92.50,11.14" target="#b42">(Safran et al. 2006</ref>).
</s>
<s coords="23,215.06,267.89,342.94,11.14;23,54.00,295.49,504.01,11.14;23,54.00,323.09,222.14,11.14">The proline p402 and p564 present within the oxygen degradation domain (ODD) of HIF1α, when hydroxylated by HIF-PHDs, allow its binding to the VHL protein that target it for proteasomal degradation.</s>
<s coords="23,279.87,323.09,278.13,11.14;23,54.00,350.69,366.05,11.14">In this way, the stabilization of ODD can be used as a marker of HIF1α stability
<ref type="bibr" coords="23,197.48,350.69,95.96,11.14" target="#b42">(Safran et al. 2006</ref>
<ref type="bibr" coords="23,293.44,350.69,120.93,11.14" target="#b48">, Smirnova et al. 2010)</ref>.
</s>
<s coords="23,423.35,350.69,134.65,11.14;23,54.00,378.29,504.01,11.14;23,54.00,405.89,504.01,11.14;23,54.00,433.49,185.14,11.14">Because of the luciferase tagged with ODD, the increase in ODD stability leads to a proportional increase in the luciferase activity and this provides a very good way of measuring the HIF1α stability in a quantitative manner with a wide dynamic range.</s>
<s coords="23,241.74,433.49,316.26,11.14;23,54.00,461.09,54.70,11.14">To this end, we used SH-SY5Y cells stably expressing ODDluciferase.</s>
<s coords="23,114.49,461.09,443.52,11.14;23,54.00,488.69,504.01,11.14;23,54.00,516.29,245.80,11.14">These cells were made by co-transfecting ODD-luciferase plasmid along with a puromycin resistance plasmid in SH-SY5Y cells and stably transfected cells were positively selected in presence of 4μg/ml of puromycin.</s>
<s coords="23,304.37,516.29,253.64,11.14;23,54.00,543.89,205.57,11.14;23,259.57,542.76,11.58,7.46;23,276.88,543.89,244.71,11.14">Luciferase activity was measured by luciferase assay kit (promega) using an LMaxII TM microplate luminometer (molecular Devices).</s>
<s coords="23,527.35,543.89,30.65,11.14;23,54.00,571.49,492.45,11.14;23,546.45,570.36,11.58,7.46;23,54.00,599.09,90.05,11.14">ODDluciferase activity was normalized to the protein content of each well measured by Bio-Rad DC TM protein assay kit.</s>
</p>
</div>
Also good for https://github.com/DataSeer/dataseer-web/issues/461
For this example (preprint):
[Uploading document_sentence_segmentation_issues.pdf…]()
we have some incomplete bounding boxes for coordinates at sentence-level, see the 5 last sentences of this paragraph: