proycon / foliatools

A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.
GNU General Public License v3.0
10 stars 4 forks source link

tei2folia: autodeclare should be enabled? #52

Open pirolen opened 1 year ago

pirolen commented 1 year ago

I pip installed foliatools. On the attached small test file I run tei2folia and got the below error.

In the main.py of foliapy I see #autodeclare is enabled (default for FoLiA v2).

$ tei2folia  --traceback  /home/pirol/quanti/devel/diagn/collate1.tei.xml -o /home/pirol/quanti/devel/diagn/
Instantiating XML parser
Converting /home/pirol/quanti/devel/diagn/collate1.tei.xml
VALIDATION ERROR on full parse by library in /home/pirol/quanti/devel/diagn/collate1.tei.xml
DeclarationError: Encountered an instance without proper declaration: Comment <comment>!
-- Full traceback follows -->
Traceback (most recent call last):
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/foliatools/tei2folia.py", line 86, in convert
    doc = folia.Document(tree=transformed, debug=kwargs.get('debug',0))
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 7427, in __init__
    self.parsexml(kwargs['tree'])
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 8646, in parsexml
    return Class.parsexml(node,self)
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 3575, in parsexml
    return super(Comment,Class).parsexml(node, doc, **kwargs)
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 3416, in parsexml
    instance = Class(doc, *args, **kwargs)
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 3546, in __init__
    super(Comment,self).__init__(doc, *args, **kwargs)
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 659, in __init__
    kwargs = self.parsecommonarguments(doc, **kwargs)
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 787, in parsecommonarguments
    self.checkdeclaration()
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 1190, in checkdeclaration
    raise DeclarationError("Encountered an instance without proper declaration: " + self.__class__.__name__ + " <" + self.__class__.XMLTAG + ">!")
folia.main.DeclarationError: Encountered an instance without proper declaration: Comment <comment>!
Unable to convert  /home/pirol/quanti/devel/diagn/collate1.tei.xml

collate1.tei.xml.txt

proycon commented 1 year ago

The autodeclare applies only if you use foliapy to create a new FoLiA resource. You'd probably expect tei2folia would use that, but tei2folia creates the initial FoLiA via XSLT and then uses foliapy for various postprocessing, so in this case the autodeclare was never used.

This looks like a bug in tei2folia, it should have declared the comment indeed. I'll look into it.

proycon commented 1 year ago

Ok, it's not a bug in a missing declaration after all, but it's breaking because the input is unexpected. I already could't imagine the comment declaration was missing (it's always there by default). Something goes completely wrong parsing this TEI input. The initial output from the XSLT processor is:

$ xsltproc ~W/foliatools/foliatools/tei2folia.xsl collate1.tei.xml
WARNING: Unknown tag: cx:apparatus (in )
<?xml version="1.0"?>
<comment xmlns="http://ilk.uvt.nl/folia" xmlns:folia="http://ilk.uvt.nl/folia">[tei2folia WARNING] Unhandled tag: cx:apparatus (in )</comment>

cx:apparatus is your root tag there and tei2folia has no idea what that is. It expects a <TEI> node at the root. In fact, cx is an entirely different namespace (http://interedition.eu/collatex/ns/1.0), probably some extension to TEI? There doesn't seem to be a similarly named element in the TEI P5 guidelines.

I also see <rdg> and <app> elements in your document, which the converter doesn't know yet either (but those do seem to be valid TEI). If you want support for such documents, I'll have to investigate how to best map these elements to FoLiA. I see this documentation covers it nicely.

pirolen commented 1 year ago

Ah, my bad for not spotting that. I simply copied the invalid TEI file from a demo GUI without looking closer. In fact, I was trying to see how FoLiA could render (span?) annotations for variations among different versions of edited text. Of course this is very specific and I cannot expect it to be covered by the converter. I wonder if FLAT could visualize such spans well, since (as my usual use case) the goal is to let end users correct for errors, in this case false alignments and HTR.

E.g. alignments such as

Screenshot 2023-02-08 at 17 13 02