Funding, acknowledgement statements are not split into sentences

lfoppiano commented 3 months ago

I've noticed that while the data availability is split into sentences, the funding statement is not. Is this by design or should be implemented?

Example:

        </body>
        <back>

            <div type="funding">
<div xml:id="_ERHBmGS"><p xml:id="_WYZCd2J">Funding: This work was supported by the <rs type="funder">National Natural Science Foundation of China</rs> (<rs type="grantNumber">51561009</rs>), the <rs type="funder">Natural Science Foundation of Jiangxi Province</rs> (<rs type="grantNumber">20192BAB206004</rs> and <rs type="grantNumber">20202BAB214003</rs>), the <rs type="funder">Key Research and Development Program of Jiangxi Province</rs> (<rs type="grantNumber">20202BBE53014</rs>), the <rs type="funder">Open Foundation of Guo Rui Scientific Innovation Rare Earth Functional Materials Co</rs>., Ltd.(<rs type="grantNumber">KFJJ-2019-0004</rs>), the <rs type="funder">Doctoral Start-up Foundation of Jiangxi University of Science and Technology (205200100110)</rs>, and the <rs type="funder">Foundation of Jiangxi Educational Department</rs> (<rs type="grantNumber">GJJ200832</rs> and <rs type="grantNumber">GJJ190478</rs>).Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.</p></div>
            </div>
            <listOrg type="funding">
                <org type="funding" xml:id="_3wHAxav">
                    <idno type="grant-number">51561009</idno>
                </org>
                <org type="funding" xml:id="_NRNDwrU">
                    <idno type="grant-number">20192BAB206004</idno>
                </org>
                <org type="funding" xml:id="_uNWJMnb">
                    <idno type="grant-number">20202BAB214003</idno>
                </org>
                <org type="funding" xml:id="_2MPuZAy">
                    <idno type="grant-number">20202BBE53014</idno>
                </org>
                <org type="funding" xml:id="_B7kBgef">
                    <idno type="grant-number">KFJJ-2019-0004</idno>
                </org>
                <org type="funding" xml:id="_tk7RJ29">
                    <idno type="grant-number">GJJ200832</idno>
                </org>
                <org type="funding" xml:id="_mCAyMcx">
                    <idno type="grant-number">GJJ190478</idno>
                </org>
            </listOrg>

            <div type="availability">
<div xmlns="http://www.tei-c.org/ns/1.0" xml:id="_Y8sCy4Q"><p xml:id="_8VCfdSN"><s xml:id="_9cHCbev" coords="11,167.27,420.46,292.63,8.63">Data Availability Statement: Data sharing is not applicable to this article.</s></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head xml:id="_N3EAdDh">Conflicts of Interest:</head><p xml:id="_Gha9GTZ"><s xml:id="_uN9vzJZ" coords="11,252.96,438.15,165.99,8.63">The authors declare no conflict of interest.</s></p></div>
            </div>

energies-14-08509.pdf

scottkerr-dataseer commented 3 months ago

Good question. Maybe we could ask Tim about sentence-level granularity in general in any section. It does come at some cost. Maybe we should support 2 modes.

I don't know what the significance of splitting into sentences is in the system. I know it doesn't play a role in deliverables to customers (unless it influences the rules - e.g. number of sentences with a specific value). It may be primarily for debugging purposes.

lfoppiano commented 2 months ago

I also found that the acknowledgement is not split into sentences. I'm assuming can be the same case.

lfoppiano commented 2 months ago

Digging deeper I notice that the funding statement is correctly split into sentences, however they are lost when it's passed through the acknowledgment/funding parser:


fundingStmt = getSectionAsTEI("funding",
                "\t\t\t",
                doc,
                SegmentationLabels.FUNDING,
                teiFormatter,
                resCitations,
                config);
            if (fundingStmt.length() > 0) {
                MutablePair<Element, MutableTriple<List<Funding>,List<Person>,List<Affiliation>>> localResult = 
                    parsers.getFundingAcknowledgementParser().processingXmlFragment(fundingStmt.toString(), config);

                if (localResult != null && localResult.getLeft() != null){
                    String local_tei = localResult.getLeft().toXML();
                    local_tei = local_tei.replace(" xmlns=\"http://www.tei-c.org/ns/1.0\"", "");
                    annexStatements.add(local_tei);
                } else {
                    annexStatements.add(fundingStmt.toString());
                }

kermitt2 commented 2 months ago

Hello, indeed, everywhere the funding-acknowledgement parser is applies, the sentence segmentation is ignored. The reason is that it would require to take into account the (numerous) annotations produced by this model when re-segmenting into sentences, which was not supported by the sentence segmentation (it only supports reference marker annotations).

As the current sentence segmentation is already quite complex, I thought about another approach, a more generic sentence segmentation, which I developed working on the final TEI XML directly and I think supporting any existing and future inline markup - this is available here: https://github.com/kermitt2/Pub2TEI/blob/master/src/main/java/org/pub2tei/document/XMLUtilities.java#L194

One idea would be to move to this simple generic sentence segmentation, instead of extending and complexifying the existing one.

(as visible on Pub2TEI the other advantage of the generic approach working on TEI XML directly is that it can be applied to any TEI XML from Pub2TEI or from LaTeXML, making possible sentence segmentation consistent for all these sources, even if they introduce unexpected/new markup inline with the text in the future)

lfoppiano commented 2 months ago

Hello, indeed, everywhere the funding-acknowledgement parser is applies, the sentence segmentation is ignored. The reason is that it would require to take into account the (numerous) annotations produced by this model when re-segmenting into sentences, which was not supported by the sentence segmentation (it only supports reference marker annotations).

Understood. It become more clear once I saw the TEIFormatter part related to funding and acknowledgments.

As the current sentence segmentation is already quite complex, I thought about another approach, a more generic sentence segmentation, which I developed working on the final TEI XML directly and I think supporting any existing and future inline markup - this is available here: https://github.com/kermitt2/Pub2TEI/blob/master/src/main/java/org/pub2tei/document/XMLUtilities.java#L194

One idea would be to move to this simple generic sentence segmentation, instead of extending and complexifying the existing one.

Sure, at the moment the current segmentation was just extended to avoid URLs being split between sentences (#1097). Because once the offset positions are collected is just a matter of extending the list of forbidden positions. Since the work to output the URL into the TEI might take some time and substantially more effort, I made two separate PRs (eventually changes in the segmenter might be reverted in this PR).

(as visible on Pub2TEI the other advantage of the generic approach working on TEI XML directly is that it can be applied to any TEI XML from Pub2TEI or from LaTeXML, making possible sentence segmentation consistent for all these sources, even if they introduce unexpected/new markup inline with the text in the future)

lfoppiano commented 2 months ago

I think, with this approach (segmenting after the "final" markup is built) we won't be able to generate coordinate for each sentences because we have lost the layout token information after the transformation to XML.

One solution comes to my. mind would be to work on the layout tokens before the TEI transformation and collect all the item in a list and apply them in order given that they are not overlapping, the same I did here: https://github.com/kermitt2/grobid/blob/0b5e2321737e6c2f9675f10661832b338b58cf54/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java#L1570

This would require to remove any TEI dependency from the funding/acknowledgment parser and deal with the transformation in TEI outside the parser, instead of processing the Element/Node XML. I'm planning to cement it with a battery of tests. 😅

@kermitt2 please let me know if you have any comment.

lfoppiano commented 2 months ago

After s few days trying different solutions, I implemented it by modifying the processXMLfragment. In this way the sentences are just reused and the funding-acknowledgment entities are applied on them, rather than on the stripped text from the paragraph.

This approach also preserve the sentence coordinates and the reference markers that were lost as well.

lfoppiano commented 2 months ago

I've started testing and noticed that in rare cases (although possible), the sentence segmentation, which is performed before the funding-acknowledgment model, result in sentences that fall on funding-acknowledgment annotations.

e.g. The first is the original version, without the sentence segmentation:

<div type="acknowledgement">
    <div>
        <head coords="31,72.00,491.09,114.40,12.58">Acknowledgments:</head>
        <p coords="31,191.82,493.44,347.12,9.57;31,72.00,522.72,81.26,9.57">We thank
            <rs type="person">Drs. Carsten Korth</rs> and
            <rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
        </p>
    </div>
</div>

Here the first sentence falls on the annotation "Drs.Carsten Korth":

<div type="acknowledgement">
    <div>
        <head>Acknowledgments:</head>
        <p>
            <s>We thank Drs.</s>
            <s>Carsten Korth and
                <rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
            </s>
        </p>
    </div>
</div>

I've then worked out a solution that allow merging and updating sentences that are in this situation, including their coordinates.

Here the result:

<div type="acknowledgement">
                <div>
                    <head coords="31,72.00,491.09,114.40,12.58">Acknowledgments:</head>
                    <p>
                        <s coords="31,191.82,493.44,63.87,9.57;31,258.46,493.44,280.48,9.57;31,72.00,522.72,81.26,9.57">We thank 
                            <rs type="person">Drs.Carsten Korth</rs> and 
                            <rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
                        </s>
                    </p>
                </div>
            </div>

kermitt2 / grobid

Funding, acknowledgement statements are not split into sentences #1090