korpling / pepperModules-MergingModule

This project provides a Pepper module for the merging of data on several possible levels.
Other
2 stars 2 forks source link

Correct way to merge XML escapes #9

Open amir-zeldes opened 8 years ago

amir-zeldes commented 8 years ago

@thomaskrause @FlorianZipser :

What is the correct way to merge XML escapes with plain text versions of themselves? For example, suppose I have a conll file with a literal ampersand token & and a TreeTagger file with a token &:

tt/corp1/doc1.tt:

&   NN  &
bla VV  bla

dep/corp1/doc1.conll10

1   &   _   NN  _   _   0   root    _   _
2   bla _   VV  _   _   1   dep _   _

Since I can't literally map an ampersand to something in pepperparams (it's an XML violation and crashes), I tried the following merging customization:

<property key="escapeMapping">"&amp;":"&amp;amp;"</property>

This basically assumes the merger reads the first character as 'normal ampersand', and the mapped replacement as 'normal ampersand followed by 'amp;' - which is what we have in the TT file. I've played around with some other options, but nothing seems to work. I get no error, but no annotations from whichever module is second. If I set both documents to literally have &amp;, then I get:

Cannot start the traversing for merging document-structure, since no tokens exist for document

So what is the correct way to make sure a plain & is paired with escaped representations?

amir-zeldes commented 8 years ago

OK, it's looking like my problem has two parts:

  1. The TreeTagger format is not interpreted as XML, at least not in the token parts. I can merge RST + conll successfully, where RST has &amp; and conll has literal &, but not TT. I may solve this using a TT importer property, either targeting XML escapes or generally allowing token string substitutions.
  2. In what I think is a bug, the function getRoots() does not return tokens not covered by any non-terminal annotation. This is what cause the importer to ignore all of my second source's annotations without an error message. When I added sentence nodes to conll using <property key="conll.SENTENCE">TRUE</property> this problem disappeared. I'm treating this as a bug and can fix it myself.

As an aside, I had a look at the merging code in createBaseTextNormOriginalMapping() in the merger mapper - I can see why it doesn't allow multichar > single char replacements, but that's really too bad. That would have solved the problem in a more general way, but I don't have enough time to deal with it right now. It would have to be totally reworked to allow literal &amp; coming from, say, TT, to be replaced with &.