kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.54k stars 452 forks source link

Documentation: FullText annotation guidelines, ref types, figure type `box` and other #664

Open de-code opened 4 years ago

de-code commented 4 years ago

Hi,

I am in the process of generating fulltext training data.

I noticed that the annotation guidelines that mention the following types for ref:

The TEIFulltextSaxParser appears to support:

i.e. box doesn't seem to be supported. But on the other hand section is, which is not documented. Although I couldn't find examples for section or box references in the corpus. (But bioRxiv have both)


Perhaps related, the figure type box doesn't seem to be supported (but I had also raised #655).


While other models seem to use <note type="other">, it appears to be <other> for the fulltext model. There doesn't seem to be documentation relating to it. Also the corpus didn't contain any examples. But I found examples in my generated training data of some text not really belonging there. Probably a fault of the segmentation model.

EDIT: it seems that the training data generator actually generates <note type="other"> instead of <other>.

kermitt2 commented 4 years ago

There are a few boxes annotated in the ISTEX annotated fulltext (the box and the reference to the box), but they are so few that it is not usable at all at this stage. The idea was just to have an annotation scheme to at least annotate them. For really supporting "box", we would probably need 100 times more training data or even more (and likely some specific layout features for the box).

Reference to a section is a bit similar, too few at this stage to be able to do anything, but that would be easier to support than boxes because much more frequent and simpler. However, I was thinking to address this only after some effort to better extract and parse the section title numbers (but for this too I was waiting for more training data).

About the <other>, it should be <note type="other"> indeed for consistency, but normally we just need to leave the part not annotated under <text> to get it as "other" label, which is more readable than adding a <note> mark-up. In general, the training fulltext TEI not really stable and finished (for instance the <body> element is missing to have something TEI valid).

de-code commented 4 years ago

There are a few boxes annotated in the ISTEX annotated fulltext (the box and the reference to the box), but they are so few that it is not usable at all at this stage. The idea was just to have an annotation scheme to at least annotate them. For really supporting "box", we would probably need 100 times more training data or even more (and likely some specific layout features for the box).

Reference to a section is a bit similar, too few at this stage to be able to do anything, but that would be easier to support than boxes because much more frequent and simpler. However, I was thinking to address this only after some effort to better extract and parse the section title numbers (but for this too I was waiting for more training data).

I understand that properly supporting it will be more work (and data). I guess I would just suggest that when parsing the XML according to the guidelines, then it wouldn't log errors. It could perhaps map them for now to other sensible tags. e.g. box title and paragraphs to regular section title and paragraphs (which seems to be how it was extracting it from at least one of the few examples).

(BTW the bioRxiv 6000 training samples seem to contain just over 2000 section references, references to boxed text just under 100)

About the <other>, it should be <note type="other"> indeed for consistency, but normally we just need to leave the part not annotated under <text> to get it as "other" label, which is more readable than adding a <note> mark-up.

But that bit me before. Because then it was using the surrounding tag instead of other. But I would agree that it would be more intuitive.