kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.43k stars 443 forks source link

Feedback from a Spanish Dissertations Reference Extraction Run #809

Open silviaegt opened 3 years ago

silviaegt commented 3 years ago

Hey @lfoppiano, @kermitt2!

First of: thank you for the amazing-and-freely-available tool you've created.

It is the backbone of my dissertation: a history of recent research in Humanities PhD dissertations from seven different countries (Mexico, Chile, Argentina, Spain, Germany, UK, US).

I am fully aware your tool was not develop for this type of document (nor specifically trained on non-English datasets) but I'm writing to see if my results could be used to improve and contribute the community you have built around your tool.

With this goal in mind, my colleague @rodyoukai and I created a python script (90% Rodrigo's writing) to parse the TEI references (biblStruct) from a GROBID run on 1,139 dissertations from Mexico. We got 95,382 distinct references for 766 of them.

In order to understand what type of "errors" we got, I created an R script to identify mistakes in the field I believe has the most consistent format: year (YYYY). I used this a proxy for references with errors, the resulting table looks like this: image

I have used that dataframe to train a simple logistic regression classifier in order to understand better which tokens are better at "predicting" that a parsed reference has a mistake, and this is what I got:

image

As you can see, words within the authors tag like ibid/ibíd/ibidem are very common in wrongly parsed references and I was wondering what the best way would be to train GROBIDs algorithm to get better at this type of issues.

caifand commented 3 years ago

Hi @kermitt2 I have been working with @silviaegt a bit for looking into the issues she and her colleagues encounter. She gets some abnormal values in the date field and also in other fields (title for instance. Correct me if I am wrong @silviaegt ). These specifically come from the reference segmentation results. I am wondering if it's an issue that needs more training (e.g., more training in Spanish or as well as other languages altogether?); though the dates should be more language-independent than other fields from my understanding?

kermitt2 commented 3 years ago

Hi @silviaegt and @caifand !

Thanks a lot for the issue and the kind words on Grobid and this feedback.

This work is very challenging for Grobid in its current state, for 3 reasons: 1) Grobid does not support dissertation, 2) Grobid training data do not include "Humanities" style references, 3) there is almost no reference in Spanish in the training data (and maybe not all, I don't remember).

1) The first thing I have to stress is that unfortunately Grobid currently does not really support dissertations, thesis, etc. in general any monographs including books and full conference proceedings. It is trained only on "article" type documents, so covering also individual chapters, short papers, short reports, ... this kind of documents. It will "see" a dissertation as a standalone article, and has no modelling of chapters, table of content/figures/tables, etc. and does not really expect to see references where they are actually in a dissertation.

I tried in the last years to get an additional "monograph" model to cover all these document types, as additional starting model in the Grobid cascade, but it didn't take off unfortunately and it would require a structured effort I think - it's not something easy to do in a project maintained/improved on spare time. Training data in particular is hard to get without proper funding. But technically, there is no reason that it would not work very well.

So it explains why you get references only in 766 dissertations out of 1,139, it's luck that these 766 dissertations look enough as a "standalone article" to find references. The citation contexts for these references are also probably very bad.

2) The second limitation is the style of references in the Humanities, which are also not really covered by Grobid until now. So there is no training example with ibid, ibidem, etc. However, it should be very easy to add them in the training data. A few examples should be enough to avoid confusion with author names. Your logistic regression classifier looks excellent to spot this kind of errors !

How to add examples in the training data: see https://grobid.readthedocs.io/en/latest/training/General-principles/ Basically: process the dissertation PDF with the createTraining batch, then edit the *.training.references.tei.xml files for the examples with ibid/ibíd/ibidem. Annotation guidelines are there

3) Probably adding a bit of training data in Spanish in the reference-segmenter and citation model would help (given we have around 0 now), but I can't really say anything more precise without looking at the results. There will certainly be always issues with the overall segmentation and finding the reference areas in dissertations, because of 1). When the date "chunk" is available, it certainly strongly language independent. I suppose the problem is the context used to identify the date chunk in a reference string, which could lead to errors because it is using English wording context mostly now.

silviaegt commented 3 years ago

Wow @kermitt2, thank you so much for your answer and all the tips to train the data!

I completely understand GROBID does not support this format (dissertations) nor the language (Spanish) nor the subject (Humanities) and I know it is a whole lot of work too! As you said we need funding and a structured effort/funding. However since it actually did work into some extent, I was just wondering if there would be any workarounds that could perhaps capitalize an approach that is considerate with the time/resources we have.

These are my two ideas for it:

  1. Certain problems seem to be consistent, so I was hoping I could only tag references with those patterns and create some artificial examples that contain those problems?
  2. Also, I have seen some mistakes that do not seem to be language/discipline dependant but rather that seem to come from an error in the original training data? For instance the most common "year" error was<date type="published" when="0200" /> - and looking at my sources I saw that for the most part, page numbers where interpreted as date, so perhaps (only perhaps) there are some papers in the training data that had this mistake as well?

Anyways, thank you so much for taking the time!

kermitt2 commented 2 years ago

Hi @silviaegt !

For info, I've added the examples of your error page to the training data with commit 2e30c274c93575a481d26c8ee771a3cd3fa743a7 (file citations.xml).

If you have more error cases ("real" case only), this would be very welcome because it would allow to extend the coverage of the models.