kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.46k stars 445 forks source link

Annotation of footnoted references for training custom Grobid models #1171

Open cboulanger opened 1 week ago

cboulanger commented 1 week ago

I want to extract reference data from articles that use footnotes instead of a bibliography. The footnotes contain the references, mixed with additional commentary. Since Grobid was not trained on this kind of messy data, it does not perform very well when confronted with this type of source material.

I have a dataset of high-quality annotations in a different TEI format which I want to convert into the schema that Grobid uses for training. I have used the 'createTraining' command to produce the training files for the reference model and am currently trying to convert my data into something that fits the structure of the xml that I see there.

One major problem is of course that the bibl nodes in my data are not placed in the TEI/text/back/listBibl node similar to the existing training data, but in individual TEI/text/body/note nodes. The question is what Grobid expects to be trained with. Should a bibliographic section be "faked" from the given material or can the structure be kept? See an example.

The references.referenceSegmenter model actually recognizes footnotes very well but classifies them as bibl, i.e <bibl><label>39</label> ...</bibl> instead of the "TEI"-ish way <note n="39" type="footnote" place="bottom"><bibl>... </bibl></note>. Since Grobid is not for TEI annotation but for metadata extraction, I don't care too much but still the question is if my material needs to be transformed or if Grobid can (and should) be trained to differentiate bibliographies from footnotes this way.

Happy to hear your thoughts.

cboulanger commented 1 week ago

To add to this: I assume that making Grobid learn the concept of footnotes containing full bibliographic references requires to retrain the page segmentation model...

lfoppiano commented 1 week ago

I think the footnotes are already recognised by the segmentation model. Do you have one or two PDF example that I can have a look?

IMHO the challenge it's more to understand when the references are in the footnotes and when they are not, and to pass the right piece through the bibliographic reference parser.

kermitt2 commented 1 week ago

Hello @cboulanger !

A superficial support of bibliographical references as footnotes is to encode them as independent in the segmentation training file, not as footnotes:

bla bla^5 bla
</body>

<listBibl>5. See for example Boulanger, C. 2024. "The problem of encoding references in footnote". 
Journal of complex issues, Volume 1</listBibl>. 

<page>2</page>

<body>
bla bla

When processing a file, the reference areas, instead of appearing as one block (the bibliographical section), appear in multiple blocks (several foot notes) that will be combined to form a list of bibliographical references processed then as usual. These reference segments are further segmented, parsed and will appear together in the final TEI, not anymore as footnote. Nothing to change in Grobid to have that working.

The problems are that it's not working well when the reference is mixed with other text, as it is often the case in Humanties and Law, and we loose the fact that it was a footnote and that the footnote marker in text body was in fact a reference marker. For going beyond this superficial way of dealing with references in footnote, maybe we would need to introduce two types of foot notes in the segmentation model, the normal one and the "reference footnote" as a new label, that would trigger a different process.

cboulanger commented 1 week ago

Hi, thank you for your responses and feedback!

Unfortunately, the PDFs that I need to work with are not Open Access, but hopefully you have access to them through your institutions. Let me know if not, I could send them to you by email.

Let's take one example, having the DOI https://doi.org/10.1111/1467-6478.00057. This is an english-language article with endnotes. My annotations are originally for AnyStyle, so, for every PDF, I have an XML file and a text file where each line is tagged (for copyright reasons, I can only publish a truncated version).

AnyStyle uses a very simple format for its training files compared to Grobid, which is easy to annotate, the reason why I originally chose it. It has no real information on the page layout and could only be translated into a page segmentation GT using some clever heuristics. However, the bibliographic segmentation data can be translated relatively easily - I tried to make it as TEI-conformant as possible, using <bibl> elements (example). According to what I found in the TEI specs, footnote or endnotes are to be encoded as <note> elements (There are some problems with whitespace, which is inconsistently encoded, but let's ignore that for now). The <bibl> elements can then be further translated using existing tools into <biblStruct> (example).

In addition to this article (English, endnotes), I have prepared and manually checked three other annotations which are representative of the larger training corpus:

I do not have the resources to re-annotate Grobid training files from scratch. Instead, I want to be able to use my existing data. The pragmatic way would just be to mimick the structure Grobid has already learned, and it would be enough to just get the reference data out of new data. It would be nice, however to be able to teach Grobis some new tricks, and have the result be more TEI-conformant so that in the future, some more fine-grained analysis of scholarly articles was possible, for example, to keep the citation context (analysis of whether a footnote is supportive, contradicting, etc.). That's why I am unsure how to proceed at this moment.

cboulanger commented 1 week ago

A similar question of "forward-compatibility" for ML-based TEI annotation concerns footnotes containing back-references such as "id, p. 56" or "See Doe, op. cit (n. 5), p 45", which carry no new bibliographic information as such, but annotating it would provide rich information that could be harvested in later analyses. I know that is out of the scope of what Grobid is made for, but if Grobid could be trained to recognize these patterns, that would really open some new research venues.

cboulanger commented 1 day ago

Hi, maybe it makes more sense to start small and instead of thinking about the footnotes in the page context, I should probably first focus on the main problem: whether Grobid can be trained to deal with messy strings that contain more than one reference. That has nothing to do with the footnote itself, which might as well contain a well-formatted reference that Grobid has no problem parsing.

Here are two examples of extremely messy reference strings, which I have passed to Grobid's "processCitation" service.

Example 1: English footnote

3 R. Goff, ‘The Search for Principle’ (1983) Proceeedings of the British Academy 169, at 171. This is an amplification of Dicey’s remark that ‘[b]y adequate study and careful thought whole departments of law can . . . be reduced to order and exhibited under the form of a few principles which sum up the effect of a hundred cases . . .’. A. Dicey, Can English Law be taught at the Universities? (1883) 20.

Result:

<biblStruct>
    <monogr>
        <title level="m" type="main">Proceeedings of the British Academy 169, at 171. This is an amplification of Dicey&#8217;s remark that &#8216;[b]y adequate study and careful thought whole departments of law can . . . be reduced</title>
        <author>
            <persName><forename type="first">R</forename><surname>Goff</surname></persName>
        </author>
        <imprint>
            <date type="published" when="1983">1983</date>
        </imprint>
    </monogr>
    <note>The Search for Principle. to order and exhibited under the form of a few principles which sum up the effect of a hundred cases . . .&#8217;. A. Dicey, Can English Law be taught at the Universities? (1883) 20</note>
</biblStruct>

It recognizes the first one ok-ish but does not know how to deal with the comment on Dicey. The second one is just put into a note.

Example 2: German footnote

11 Dazu Grimshau, Comparative Sociology - In What Ways Different From Other Sociologies?, in: Armer/Grimshaw 3 (18). Auch der Oberbegriff „cross System comparison" wird vorgeschlagen, Tomasson, Introduction; Comparative Sociology — The State of the Art, in: Tomasson (Hrsg.), Comparative Studies in Sociology Vol. 1 (Greenwich, Conn. 1978) 1. — Über die Methoden interkultureller und internationaler Vergleiche ist inzwischen so viel geschrieben worden, daß nicht nur die Fülle des Materials schon wieder abschreckend wirkt, sondern daß es auch genügt, im Rahmen dieses Aufsatzes nur einige wichtige Aspekte anzusprechen. Bibliographien finden sich etwa bei Rokkan/Verba/Viet/Almasy 117 ff.; Vallier 423 ff.; Almasy/Balandier/Delatte, Comparative Survey Analysis — An Annotated Bibliography 1967 — 1973 (Beverly Hills, London 1976) sowie bei Marsh, Comparative Sociology (New York, Chicago, San Francisco, Atlanta 1967) 375 ff.

Grobid does not know what to do with that kind of footnote from hell:

<biblStruct>
    <analytic>
        <title level="a" type="main">&#220;ber die Methoden interkultureller und internationaler Vergleiche ist inzwischen so viel geschrieben worden, da&#223; nicht nur die F&#252;lle des Materials schon wieder abschreckend wirkt, sondern da&#223; es auch gen&#252;gt, im Rahmen dieses Aufsatzes nur einige wichtige Aspekte anzusprechen. Bibliographien finden sich etwa bei Rokkan/Verba/Viet/Almasy 117</title>
        <author>
            <persName><forename type="first">Dazu</forename><surname>Grimshau</surname></persName>
        </author>
    </analytic>
    <monogr>
        <title level="m">Comparative Survey Analysis - An Annotated Bibliography 1967 - 1973</title>
        <title level="s">Comparative Studies in Sociology</title>
        <editor>
            <persName><forename type="first">/</forename><surname>Almasy</surname></persName>
        </editor>
        <editor>
            <persName><forename type="first">/</forename><surname>Balandier</surname></persName>
        </editor>
        <editor>
            <persName><surname>Delatte</surname></persName>
        </editor>
        <meeting><address><addrLine>Greenwich, Conn; Beverly Hills, London; New York, Chicago, San Francisco, Atlanta</addrLine></address></meeting>
        <imprint>
            <date type="published" when="1967">1978. 1976. 1967</date>
            <biblScope unit="volume">1</biblScope>
            <biblScope unit="page">375</biblScope>
        </imprint>
    </monogr>
    <note>sowie bei Marsh, Comparative Sociology</note>
</biblStruct>

Here is some gold standard that shows how the result (more or less) should look like, generated from this hand-annotated <bibl>.

Now the question is - do you think Grobid can be trained to recognize these kind of messy patterns as it is (requiring only the right kind of training data), or would that require changing the code?