proycon / foliatools

A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.
GNU General Public License v3.0
10 stars 4 forks source link

tei2folia failure on CLARIAH files VOC General Missives #35

Closed martinreynaert closed 3 years ago

martinreynaert commented 3 years ago

Hi proycon,

I would very much like to convert all 589 TEI files produced by DANS Dirk Roorda from the OCR-ed VOC 'Generale Missiven' (13 volumes)(http://resources.huygens.knaw.nl/retroboeken/generalemissiven/#page=0&accessor=toc&view=homePane) to FoLiA.

I get the following errors on this one:

https://github.com/Dans-labs/clariah-gm/blob/master/xml/01/p0099.xml

as well as on others from the same source.

I have no idea what is wrong, hope you can help!

Error:

(LMdev) reynaert@violet:FOLIA$ tei2folia p0099.xml Instantiating XML parser Converting p0099.xml VALIDATION ERROR on full parse by library in p0099.xml DeclarationError: Encountered an instance without proper declaration: Comment ! Unable to convert p0099.xml

Looking forward to your response! Thanks!

Martin

proycon commented 3 years ago

I see what goes wrong, tei2folia isn't too informative since this fails in a very early stage currently. Intermediate output is:

WARNING: Unknown tag: teiTrim (in )
<?xml version="1.0"?>
<comment xmlns="http://ilk.uvt.nl/folia" xmlns:folia="http://ilk.uvt.nl/folia">[tei2folia WARNING] Unhandled tag: teiTrim (in )</comment>

I've never seen teiTrim so I'll have to build in some support to recognize that as a root tag. You'll probably also need to specify --forcenamespace because the TEI you references doesn't use the proper XML namespaces (but tei2folia can force it anyway).

proycon commented 3 years ago

Ok, this document converts now, even though the documents aren't really TEI P5 compliant. Could you test it on a few more and see if the conversion output is sane enough?