korpling / pepperModules-PAULAModules

This project provides an importer and an exporter for the linguistic converter framework Pepper (see http://corpus-tools.org/pepper/) to support the PAULA format.
Other
2 stars 1 forks source link

PAULAExporter should not include anno namespaces in token anno type attribute by default #15

Closed amir-zeldes closed 7 years ago

amir-zeldes commented 8 years ago

In PAULA, namespaces are represented as file name prefixes. This mechanism is primarily used to namespace non-terminal nodes, whose annotations then also get that namespace - this mechanism works correctly.

However for token nodes, we generally do not use namespaces, and the primary use case for PAULA, merging, also implies that tokens from different sources can be made 'the same'. This is also the behavior of the merging module with PAULA output: if the tokens match, only one tok.xml is outputted, and input namespaces are stripped.

For token annotations, this behavior doesn't work correctly. If we are merging token annotations from different sources with different token-level namespaces, the PAULAExporter correctly removes prefixes from all file names, including token feats, but keeps the namespace in the featList's type attribute. This leads to strange annotation names with a period, which get retained in conversions to other formats. For example, if we set the PTBImporter's POS tag name to penn_pos, we get this PAULA output:

<featList xmlns:xlink="http://www.w3.org/1999/xlink" type="ptb.penn_pos" xml:base="GUM_interview_ants.tok.xml">
        <feat xlink:href="#sTok1" value="NN"/>

The file itself does not have a 'ptb' prefix (correct, since this is a token annotation), but the anno name itself is strange. Retaining the namespace in this way may be desirable if we have multiple distinct token annotations with the same name (e.g. conflicting pos tags called 'pos'), but otherwise this is not desirable. I suggest a property making this behavior option, which is false by default.

amir-zeldes commented 8 years ago

I think the problem is here:

https://github.com/korpling/pepperModules-PAULAModules/blob/220ae3741896988728be35821203eddb9eac4100/src/main/java/org/corpus_tools/peppermodules/paula/Salt2PAULAMapper.java#L662

thomaskrause commented 8 years ago

Shouldn't converting from PTB to PAULA directly without merging also trigger this behaviour? The code that generates the type string should always generate the dot-notation when there is annotation namespace. Also the PAULA to salt Mapper has code to extract the namespace from the dot-notation (https://github.com/korpling/pepperModules-PAULAModules/blob/220ae3741896988728be35821203eddb9eac4100/src/main/java/org/corpus_tools/peppermodules/paula/PAULA2SaltMapper.java#L635) and thus there should no "bad" names later on.

What I don't understand is why this mechanism is in the code at all. As far as I understand the PAULA XML 1.1 specification, if I want to express the namespace for an annotation I have to encode it into the file-name. There is nothing about this special dot notation in the specification.

If I would want to have namespaces for the token annotations itself (not the nodes, so not a layer or something like that), how would I encode this according to the PAULA specification?

amir-zeldes commented 8 years ago

Yes, converting without merging also triggers this behavior, from any format in which the importer assigns a layer to the token annotations. I think this is also the cause of the "salt." prefixes in Salt Semantics annotations (since those get a layer). The point about merging was only to illustrate why I think that token annotations should not get a namespace by default (unless the user actively tries to do so).

I can't say where this mechanism comes from in general though: as you note it is not in the PAULA specification. I vaguely recall seeing this type of dot notation in some of the older PCC versions produced by the old merging routines, but at least from ANNIS2 onwards, these were never seen as what we consider to be namespaces in ANNIS. I think Julia Ritz might have written a script that removed them from PCC.

The correct PAULA way of making feat files that have namespaces is the same as in the rest of the format: using file prefixes. So if you actually call the file ptb.mycorpus.mydoc.tok_pos.xml then even if the tok file does not have this prefix and is called mycorpus.mydoc.tok.xml, the annotation itself should have a namespace.

I don't actually like the dependence on filenames in PAULA so much, but it's the most consistent way given the history of the format. I think if we ever codify a PAULA 1.2 we might want to consider allowing namespaces within the type attribute, but then I'd definitely propose to use colons (@type="ns:anno_name"). Dots are valid in filenames, unix words etc., whereas colons are reserved, and are already associated with NS semantics in XML.

For now I think we should stick to interpreting dots in types literally by default, though I left the option to produce the old behavior for backwards compatibility (not sure who wanted this or why).