proycon / foliatools

A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.
GNU General Public License v3.0
10 stars 4 forks source link

[tei2folia] Ensure the ID is suitable for use in FoLiA (ValueError: Invalid XML NCName identifier) #22

Closed proycon closed 3 years ago

proycon commented 3 years ago

Sometimes a document ID is extracted that is not a valid XML NCName, for example when converting http://worldviews.gei.de/rest/content/tei/CM_1989_FomenkyEtAl_HistoireDuCameroun_52/fre/ , as reported by @dietervu. More checks need to be implemented.

proycon commented 3 years ago

A very comprehensive check is not really feasible in XSLT 1.0 unfortunately. The above commit is rather patchy but should add at least a little bit more flexibility.

Users may be confronted with an error like:

VALIDATION ERROR on full parse by library in input.xml
ValueError: Invalid XML NCName identifier: <something>

In this case the ID that the converter extracted from the TEI (heuristically because TEI is not unambiguous here) is not valid for use by FoLiA. Users can circumvent this by augmenting their TEI header with an explicit ID to be used by FoLiA:

<publicationStmt>
 <idno type="ID">your_id</idno>
</publicationStmt>

Or contact us to adapt the converter if your collection uses another idno type.