kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.47k stars 448 forks source link

Annotation Questions on Header #491

Open sampanriver opened 5 years ago

sampanriver commented 5 years ago

I don't see any annotation guidelines for header. Could anyone give me any hint on how to annotate header? In addition, every Chinese scientific paper has title, author, abstract and affiliations in both Chinese and English. An example is attached. down.pdf Could anyone help to tell me what should be done to deal with this situation?

kermitt2 commented 5 years ago

Hi @sampanriver ! Indeed there is no annotation guidelines for header, the reason is that we want to update the header model. It is the oldest one and it has many design flaws, it is not consistent with the principles of the other models in particular for the annotations.

The objective to update the header model is now quite old (see #136) but there's some progress (PR #457) and an improved reading order has been added to pdfalto (but still to be more tested) to address current problem with the header.

For this reason, there is no annotation guidelines for the header and I would not advice to produce new training data for the header before we move to the new design, because the training data might need some revision (different token order, different labels, different labelling principles).

In the new approach (which is the one used for all the other models), only the minimum useful information is annotated and all the rest is ignored. In the current outdated header model approach, all the header content is annotated, with all the useless syntactic sugar and noise, which complexifies at the same time the annotation, the learning and the cleaning of ML results for no benefit (this approach comes from the CORA dataset which has been used when starting this project).

In addition we want to support the new cases that you mention: multiple language version for all the fields, this is very frequent for the titles and abstracts, but as you are showing for other fields too, and additional ones like discontinuous header part more distributed in the document (thanks a lot for the testing case!).

My answer is likely not very satisfying for you because it will certainly take a few months to make real progress on the new header model, but the cases you mention are simply not supported for the moment and the best is to make a new iteration with a more complete redesign than simply adding new labels.

kermitt2 commented 4 years ago

The header model has been rewritten, with new labels, new features, and so on. I removed all the old training data (including old CORA stuff).

Annotation guidelines for the header model are now available: https://grobid.readthedocs.io/en/latest/training/header/