kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.29k stars 440 forks source link

Inconsistent training data tei namespace (no namespace vs TEI namespace) #623

Open de-code opened 3 years ago

de-code commented 3 years ago

Hi,

The namespace for the training data seem to be inconsistent.

For example for the segmentation, header, reference-segmenter, fulltext and date models, the TEI files don't use any namespace.

Whereas it is using the TEI namespace http://www.tei-c.org/ns/1.0 for the citation, affiliation-address models.

This makes it a bit more difficult to work with the training data in a more generic way. Is there any reason for the difference?

kermitt2 commented 3 years ago

Hello Daniel,

Yes the original reason is that for some models the files are indeed TEI valid - they have then the namespace specified, and for others the files are more pseudo-TEI but not really TEI, so they don't have a namespace.

Possible reasons for non TEI validy: I didn't spend time to find a way to encode properly some fields, or there was no way to encode this kind of information, or the TEI encoding was too heavy/unreadable...

It doesn't really matter in Java, but I guess in Python lxml is a bit more boring to use with namespaces.

The easiest would be simply to remove the TEI namespaces from all the training files, nobody really care I think :)

TEI validity is more relevant for the TEI results, because it's important for inter-operability and to propose an encoding where one type of information is encoded in a unique way (thus the TEI customization). For training data, it's more an internal usage.

de-code commented 3 years ago

Hi Patrice,

Thank you for the quick response. And explaining the reason.

For evaluation I had actually stripped namespaces to make it easier to use. But for my auto-annotation I still need to produce the correct output XML. It is a bit more verbose. Although can't really argue against using namespaces.

I guess the confusing bit is that they follow the same structure. The training data for reference-segmenter and reference models both with text, listBibl and bibl elements (but differ in namespace usage). (And listBibl is inside the back element only for the reference model).

Maybe it is not a big enough issue. Could close it?

kermitt2 commented 3 years ago

It would be good at some point to review and make all the training data consistent (and TEI valid to complete the exercise!), let's keep the issue open :)