Open de-code opened 4 years ago
Hello Daniel,
Yes the original reason is that for some models the files are indeed TEI valid - they have then the namespace specified, and for others the files are more pseudo-TEI but not really TEI, so they don't have a namespace.
Possible reasons for non TEI validy: I didn't spend time to find a way to encode properly some fields, or there was no way to encode this kind of information, or the TEI encoding was too heavy/unreadable...
It doesn't really matter in Java, but I guess in Python lxml is a bit more boring to use with namespaces.
The easiest would be simply to remove the TEI namespaces from all the training files, nobody really care I think :)
TEI validity is more relevant for the TEI results, because it's important for inter-operability and to propose an encoding where one type of information is encoded in a unique way (thus the TEI customization). For training data, it's more an internal usage.
Hi Patrice,
Thank you for the quick response. And explaining the reason.
For evaluation I had actually stripped namespaces to make it easier to use. But for my auto-annotation I still need to produce the correct output XML. It is a bit more verbose. Although can't really argue against using namespaces.
I guess the confusing bit is that they follow the same structure. The training data for reference-segmenter
and reference
models both with text
, listBibl
and bibl
elements (but differ in namespace usage). (And listBibl
is inside the back
element only for the reference
model).
Maybe it is not a big enough issue. Could close it?
It would be good at some point to review and make all the training data consistent (and TEI valid to complete the exercise!), let's keep the issue open :)
Hi,
The namespace for the training data seem to be inconsistent.
For example for the
segmentation
,header
,reference-segmenter
,fulltext
anddate
models, the TEI files don't use any namespace.Whereas it is using the TEI namespace
http://www.tei-c.org/ns/1.0
for thecitation
,affiliation-address
models.This makes it a bit more difficult to work with the training data in a more generic way. Is there any reason for the difference?