proycon / foliatools

A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.
GNU General Public License v3.0
10 stars 4 forks source link

[tei2folia] Text body not getting converted from TEI5 doc #46

Closed pirolen closed 2 years ago

pirolen commented 2 years ago

I have a few toy TEI5 XML documents that include \<w> and \<c> elements, and annotations as \ elements. tei2folia generates output from them, but the document body is empty. What could be the reason? I am attaching the input/output docs.

The TEI was generated by INCEpTION. It uses the DKPro Core TEI reader / writer which supports a subset of TEI. The elements are listed here: https://dkpro.github.io/dkpro-core/releases/2.2.0/docs/format-reference.html#format-Tei

N.B. I randomly chose a TEI validation method: https://trafilatura.readthedocs.io/en/latest/tutorial2.html and the file did not validate.

I understand from the developers that the TEI reader / writer were developed using various TEI files from different sources as test material. If one has particular problems with data not validating, one can report this as an issue in the INCEpTION or DKPro Core GitHub issue trackers.

FA-MBK-4-3_035245008_0019_abpproc_entries.inctei.folia.xml.txt FA-MBK-4-3_035245008_0019_abpproc_entries.inctei.xml.txt

proycon commented 2 years ago

Hmm, good question. The TEI looks simple enough but something must be different from the TEI files I have used until now. The problem with TEI is that it comes in so many flavours.

Actually... I see it now fails on one of my own test files too. There must be some regression in tei2folia itself, I'll investigate.

proycon commented 2 years ago

This is fixed and released now (v2.5.4).

I also implemented support for <c> and <rs> (wasn't implemented yet), but do note that the implementation for <rs> does not map nicely to FoLiA <entity> elements, so this probably doesn't do what you want yet. In TEI it's used more as markup so I mapped it to markup in FoLiA as well.

pirolen commented 2 years ago

Awesome, thank you very much! Will test this conversion scenario in more detail.