korpling / pepper

A highly extensible plattform for conversion and manipulation of linguistic data between an unbound set of formats. Pepper can be used stand-alone as a command line interface, or be integrated as an API into other software products.
http://corpus-tools.org/pepper
Other
22 stars 3 forks source link

Converting with genericXMLImporter to exmaralda and to ANNIS #117

Open CarolinOdebrecht opened 6 years ago

CarolinOdebrecht commented 6 years ago

This issue contains several aspects concerning the genericXMLImporter, the tokenizer manipulator and the exmaralda exporter as well as the ANNISexporter. It is not entirely clear, which module causes which behaviour. The MWE is attached with the pepper workflow file and export results. Conversion with pepper Pepper_2018.01.26-SNAPSHOT.

  1. Aspect: Converting the minimal xml-file to ANNIS and to exmaralda results in outputs which contain a double token which has no further annotation. In this case, it is the word form ‘weiter’. Why do both exporter insert double tokens? In ANNIS, this token is not part of the document itself, it does not show up in the document browser but in the matches.
  2. Aspect: The property of the genericXMLImporter’ genericXml.importer.prefixSAnnotationName’ does no work, neither for the ANNIS-Exporter nor for the exmaraldaExporter. It will be ignored completely.
  3. Aspect: The exmaralda output contains a mixed-up token sequence. The tokens are not in the order they are in the xml-file. Furthermore, the span annotations are not represented correctly. The span type=‘bla’ should cover the text ‘er geht noch’. In the exmaralda file the span covers ‘weiter Das ist ein er geht noch’.
  4. Aspect: The ANNISexporter does not threat all spans as spans (with the property genericXml.importer.asSSpan). The spans ‘d’ and ‘falsch’ are represented as token annotations. The span ‘xxx’ is not represented at all in the ANNIS output.
CarolinOdebrecht commented 6 years ago

MWE4.zip