korpling / pepperModules-TreetaggerModules

This project provides an im- and an exporter to support the TreeTagger format in the linguistic converter framework Pepper (see http://corpus-tools.org/pepper/). The TreeTagger is a natural language processing tool, to annotate text with part-of-speech and lemma annotations. A detailed description of the importer can be found in section TreeTaggerImporter and a description for the exporter can be found TreeTaggerExporter.
Other
0 stars 1 forks source link

Metadata containing XML escapes is not unescaped #20

Closed amir-zeldes closed 6 years ago

amir-zeldes commented 6 years ago

It's not possible to have metadata values like:

<meta URL="<a href='X'>bla</a>">

Even though technically angle brackets don't need to be escaped inside attribute values. Making the values like this imports fine, but stays escaped in relANNIS output:

<meta URL="&lt;a href='X'&gt;bla&lt;/a&gt;">

I think this is fine to encode like this (with escapes) in TT files, but the correct behavior is for the Salt model to then contain the unescaped values (with real '<' etc.). This would result in correct ANNIS output, and other modules would be responsible for escaping their metadata writer properly.

amir-zeldes commented 6 years ago

I think the problem is here:

https://github.com/korpling/pepperModules-TreetaggerModules/blob/7b871ec552e065ed25d6931a5cb8b11f16dd5162/src/main/java/org/corpus_tools/peppermodules/treetagger/model/serialization/deserializer/Deserializer.java#L225

Coming from here:

https://github.com/korpling/pepperModules-TreetaggerModules/blob/ed6be24b01a1aa36a93f7dd60644b89ffa678b31/src/main/java/org/corpus_tools/peppermodules/treetagger/model/serialization/deserializer/XMLUtils.java#L210

Values are handled as plain strings in XMLUtils.java