kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.43k stars 443 forks source link

Sentence segmentation correction with reference markers - Figures ref markers seems not work #1043

Open lfoppiano opened 1 year ago

lfoppiano commented 1 year ago

I think I found something interesting, might be due to the change to the ref markers annotations for figures, when we changed from <ref>Fig. 3</ref> to Fig. <ref>3</ref>. I'm not sure it's actually something that can be fixed.

Example of document: https://pubs.rsc.org/en/content/articlepdf/2017/nr/c6nr09464c

I have a nice example between page 3684 and 3685: Moreover, a bimodal grain size distribution was observed in the magnet with a large loading rate [Fig. 19(a)]. On the contrary, it was observed that grains were well aligned with the c-axis parallel to the loading direc- tion in the slowly deformed magnet [Fig. 19(b)].

They appear in the output as this:

                    <s>Moreover, a bimodal grain size   
                        <ref type="bibr" target="#b78">79</ref> distribution was observed in the magnet with a large loading rate [Fig.
                    </s>
                    <s>
                        <ref type="figure" target="#fig_16">19(a)</ref>].
                    </s>
                    <s>On the contrary, it was observed that grains were well aligned with the c-axis parallel to the loading direction in the slowly deformed magnet [Fig.</s>
                    <s>
                        <ref type="figure" target="#fig_16">19(b)</ref>].
                    </s>
                </p>

Since the sentence is segmented more than one character before the reference markers (due to the . after Fig) it seems we are not able to correct the sentence range in the first part:

https://github.com/kermitt2/grobid/blob/2c720dd55344f48edab83f636f9bff640a92fbdc/grobid-core/src/main/java/org/grobid/core/utilities/SentenceUtilities.java#L144

but also since it's not a superscript, we cannot fix when synchronizing with the layout tokens in the second part:

https://github.com/kermitt2/grobid/blob/2c720dd55344f48edab83f636f9bff640a92fbdc/grobid-core/src/main/java/org/grobid/core/utilities/SentenceUtilities.java#L198C14-L198C14

Maybe we can add a special check if it's a Figure marker and the previous sentence ends with Fig. (Figure should not be a problem, it seems the issue is that the sentence segmentation get tricked by the .)

kermitt2 commented 1 year ago

Hi Luca !

Indeed only the content of the reference marker prevent a segmentation, and here the Fig. is outside the figure reference marker.

You can also try the alternative sentence segmenter, in the config:

sentenceDetectorFactory: "org.grobid.core.lang.impl.PragmaticSentenceDetectorFactory"

It is much slower but a bit better.

By default, only the out-of-the-box sentence segmenters are used, so they don't have extra rules for fig. pattern (like we did in another project). One solution would be to use those improved versions with extra rules (switching to blingfire would be nice too). That would fix the problem without ad hoc rules which might depends on the selected segmenter.

lfoppiano commented 1 year ago

Yes, indeed the pragmatic segmenter give much better results, but indeed it's 6 time slower or so 😅