kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.54k stars 452 forks source link

Generate training data for `fulltext` in areas other than the `body` #873

Open de-code opened 2 years ago

de-code commented 2 years ago

As I was replicating the training data generation, I realised that currently it is only generating fulltext training data for the body area (as per the segmentation model).

As the fulltext model is used to parse other parts of the document as well, it would make sense if the training data was generated for those as well. Areas that come to mind are:

Related code:

https://github.com/kermitt2/grobid/blob/0.7.0/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java#L1161-L1230

kermitt2 commented 2 years ago

Hi @de-code !

So far, using the fulltext model for these structures has been a fallback.

For annex sections, supplementary materials, etc. a dedicated model or process might be relevant too, but I was unsure. So it explains why these structures are not used for generating pre-annotated training data for the fulltext model. My fear in particular is to introduce specific structures that would degrade how normal full text are processed - especially given that there is so few training example for the moment just for normal text body of articles.

de-code commented 2 years ago

Thank you for the quick feedback.

Personally I find every additional model adds a certain amount of overhead. Not just in terms of code, more so with preparing the training data, training and managing the trained models, and explaining it. The structure seems to be very similar for those models that using a single model seems like a good idea. Especially with a low amount of training data you would benefit from the data being available in all of those areas.

You could make it multi-modal to allow the model to learn more specifics to the area.

Maybe I am biased as reducing the number of models was a direction I was planning to look into already.

In any case, do you think in that case the issue should be closed?

kermitt2 commented 2 years ago

The acknowledgement model is really different because it's about finding contract number, funding agency, acknowledged persons, etc.

Yes I totally agree about reducing the number of models. So far they have been introduced by necessity due to lack of training data or imbalanced training data, or both. For instance I started with only one model for figure and table structuring, because it is not so different (caption, content, title, ...), but separating in two models lead to significant better accuracy.

About multi-modal models, so mainly using transformers, I started some experiment agenda (#666 I called it multi-task, it's the same I think), but no progress so far... it's complicated to get good results and require lot's of experiments/time. From what I have seen, there is no guarantee that a multi-thing model will be better than individual models - this is a surprise :). It does not really simplify the training and managing data upstream, because anyway there's a need to segment "tasks" or "modalities", and the hard problem of consistent manual annotations for fine-tuning remain I think.

Other issues are joining layout information to text in transformers, which is, as you known well, the subject of these ongoing works like VILA or LayoutLM and the input sequence length hard to adapt to full documents.

If at some point in the future we could have a single pre-trained model for scientific content with joined layout information, applied for a range of structuring tasks, still competitive, that would be the achievement of the decade :)