Open lfoppiano opened 3 months ago
Good question. Maybe we could ask Tim about sentence-level granularity in general in any section. It does come at some cost. Maybe we should support 2 modes.
I don't know what the significance of splitting into sentences is in the system. I know it doesn't play a role in deliverables to customers (unless it influences the rules - e.g. number of sentences with a specific value). It may be primarily for debugging purposes.
I also found that the acknowledgement is not split into sentences. I'm assuming can be the same case.
Digging deeper I notice that the funding statement is correctly split into sentences, however they are lost when it's passed through the acknowledgment/funding parser:
fundingStmt = getSectionAsTEI("funding",
"\t\t\t",
doc,
SegmentationLabels.FUNDING,
teiFormatter,
resCitations,
config);
if (fundingStmt.length() > 0) {
MutablePair<Element, MutableTriple<List<Funding>,List<Person>,List<Affiliation>>> localResult =
parsers.getFundingAcknowledgementParser().processingXmlFragment(fundingStmt.toString(), config);
if (localResult != null && localResult.getLeft() != null){
String local_tei = localResult.getLeft().toXML();
local_tei = local_tei.replace(" xmlns=\"http://www.tei-c.org/ns/1.0\"", "");
annexStatements.add(local_tei);
} else {
annexStatements.add(fundingStmt.toString());
}
Hello, indeed, everywhere the funding-acknowledgement parser is applies, the sentence segmentation is ignored. The reason is that it would require to take into account the (numerous) annotations produced by this model when re-segmenting into sentences, which was not supported by the sentence segmentation (it only supports reference marker annotations).
As the current sentence segmentation is already quite complex, I thought about another approach, a more generic sentence segmentation, which I developed working on the final TEI XML directly and I think supporting any existing and future inline markup - this is available here: https://github.com/kermitt2/Pub2TEI/blob/master/src/main/java/org/pub2tei/document/XMLUtilities.java#L194
One idea would be to move to this simple generic sentence segmentation, instead of extending and complexifying the existing one.
(as visible on Pub2TEI the other advantage of the generic approach working on TEI XML directly is that it can be applied to any TEI XML from Pub2TEI or from LaTeXML, making possible sentence segmentation consistent for all these sources, even if they introduce unexpected/new markup inline with the text in the future)
Hello, indeed, everywhere the funding-acknowledgement parser is applies, the sentence segmentation is ignored. The reason is that it would require to take into account the (numerous) annotations produced by this model when re-segmenting into sentences, which was not supported by the sentence segmentation (it only supports reference marker annotations).
Understood. It become more clear once I saw the TEIFormatter
part related to funding and acknowledgments.
As the current sentence segmentation is already quite complex, I thought about another approach, a more generic sentence segmentation, which I developed working on the final TEI XML directly and I think supporting any existing and future inline markup - this is available here: https://github.com/kermitt2/Pub2TEI/blob/master/src/main/java/org/pub2tei/document/XMLUtilities.java#L194
One idea would be to move to this simple generic sentence segmentation, instead of extending and complexifying the existing one.
Sure, at the moment the current segmentation was just extended to avoid URLs being split between sentences (#1097). Because once the offset positions are collected is just a matter of extending the list of forbidden positions. Since the work to output the URL into the TEI might take some time and substantially more effort, I made two separate PRs (eventually changes in the segmenter might be reverted in this PR).
(as visible on Pub2TEI the other advantage of the generic approach working on TEI XML directly is that it can be applied to any TEI XML from Pub2TEI or from LaTeXML, making possible sentence segmentation consistent for all these sources, even if they introduce unexpected/new markup inline with the text in the future)
I think, with this approach (segmenting after the "final" markup is built) we won't be able to generate coordinate for each sentences because we have lost the layout token information after the transformation to XML.
One solution comes to my. mind would be to work on the layout tokens before the TEI transformation and collect all the item in a list and apply them in order given that they are not overlapping, the same I did here: https://github.com/kermitt2/grobid/blob/0b5e2321737e6c2f9675f10661832b338b58cf54/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java#L1570
This would require to remove any TEI dependency from the funding/acknowledgment parser and deal with the transformation in TEI outside the parser, instead of processing the Element/Node XML. I'm planning to cement it with a battery of tests. 😅
@kermitt2 please let me know if you have any comment.
After s few days trying different solutions, I implemented it by modifying the processXMLfragment
. In this way the sentences are just reused and the funding-acknowledgment
entities are applied on them, rather than on the stripped text from the paragraph.
This approach also preserve the sentence coordinates and the reference markers that were lost as well.
I've started testing and noticed that in rare cases (although possible), the sentence segmentation, which is performed before the funding-acknowledgment model, result in sentences that fall on funding-acknowledgment annotations.
e.g. The first is the original version, without the sentence segmentation:
<div type="acknowledgement">
<div>
<head coords="31,72.00,491.09,114.40,12.58">Acknowledgments:</head>
<p coords="31,191.82,493.44,347.12,9.57;31,72.00,522.72,81.26,9.57">We thank
<rs type="person">Drs. Carsten Korth</rs> and
<rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
</p>
</div>
</div>
Here the first sentence falls on the annotation "Drs.Carsten Korth":
<div type="acknowledgement">
<div>
<head>Acknowledgments:</head>
<p>
<s>We thank Drs.</s>
<s>Carsten Korth and
<rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
</s>
</p>
</div>
</div>
I've then worked out a solution that allow merging and updating sentences that are in this situation, including their coordinates.
Here the result:
<div type="acknowledgement">
<div>
<head coords="31,72.00,491.09,114.40,12.58">Acknowledgments:</head>
<p>
<s coords="31,191.82,493.44,63.87,9.57;31,258.46,493.44,280.48,9.57;31,72.00,522.72,81.26,9.57">We thank
<rs type="person">Drs.Carsten Korth</rs> and
<rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
</s>
</p>
</div>
</div>
I've noticed that while the data availability is split into sentences, the funding statement is not. Is this by design or should be implemented?
Example:
energies-14-08509.pdf