Open lfoppiano opened 1 year ago
Hi Luca !
Indeed only the content of the reference marker prevent a segmentation, and here the Fig.
is outside the figure reference marker.
You can also try the alternative sentence segmenter, in the config:
sentenceDetectorFactory: "org.grobid.core.lang.impl.PragmaticSentenceDetectorFactory"
It is much slower but a bit better.
By default, only the out-of-the-box sentence segmenters are used, so they don't have extra rules for fig.
pattern (like we did in another project). One solution would be to use those improved versions with extra rules (switching to blingfire would be nice too). That would fix the problem without ad hoc rules which might depends on the selected segmenter.
Yes, indeed the pragmatic segmenter give much better results, but indeed it's 6 time slower or so 😅
I think I found something interesting, might be due to the change to the ref markers annotations for figures, when we changed from
<ref>Fig. 3</ref>
toFig. <ref>3</ref>
. I'm not sure it's actually something that can be fixed.Example of document: https://pubs.rsc.org/en/content/articlepdf/2017/nr/c6nr09464c
I have a nice example between page 3684 and 3685:
Moreover, a bimodal grain size distribution was observed in the magnet with a large loading rate [Fig. 19(a)]. On the contrary, it was observed that grains were well aligned with the c-axis parallel to the loading direc- tion in the slowly deformed magnet [Fig. 19(b)].
They appear in the output as this:
Since the sentence is segmented more than one character before the reference markers (due to the
.
afterFig
) it seems we are not able to correct the sentence range in the first part:https://github.com/kermitt2/grobid/blob/2c720dd55344f48edab83f636f9bff640a92fbdc/grobid-core/src/main/java/org/grobid/core/utilities/SentenceUtilities.java#L144
but also since it's not a superscript, we cannot fix when synchronizing with the layout tokens in the second part:
https://github.com/kermitt2/grobid/blob/2c720dd55344f48edab83f636f9bff640a92fbdc/grobid-core/src/main/java/org/grobid/core/utilities/SentenceUtilities.java#L198C14-L198C14
Maybe we can add a special check if it's a Figure marker and the previous sentence ends with
Fig.
(Figure should not be a problem, it seems the issue is that the sentence segmentation get tricked by the.
)