Open amir-zeldes opened 8 years ago
OK, it's looking like my problem has two parts:
&
and conll has literal &
, but not TT. I may solve this using a TT importer property, either targeting XML escapes or generally allowing token string substitutions.<property key="conll.SENTENCE">TRUE</property>
this problem disappeared. I'm treating this as a bug and can fix it myself.As an aside, I had a look at the merging code in createBaseTextNormOriginalMapping() in the merger mapper - I can see why it doesn't allow multichar > single char replacements, but that's really too bad. That would have solved the problem in a more general way, but I don't have enough time to deal with it right now. It would have to be totally reworked to allow literal &
coming from, say, TT, to be replaced with &
.
@thomaskrause @FlorianZipser :
What is the correct way to merge XML escapes with plain text versions of themselves? For example, suppose I have a conll file with a literal ampersand token
&
and a TreeTagger file with a token&
:tt/corp1/doc1.tt:
dep/corp1/doc1.conll10
Since I can't literally map an ampersand to something in pepperparams (it's an XML violation and crashes), I tried the following merging customization:
This basically assumes the merger reads the first character as 'normal ampersand', and the mapped replacement as 'normal ampersand followed by 'amp;' - which is what we have in the TT file. I've played around with some other options, but nothing seems to work. I get no error, but no annotations from whichever module is second. If I set both documents to literally have
&
, then I get:Cannot start the traversing for merging document-structure, since no tokens exist for document
So what is the correct way to make sure a plain
&
is paired with escaped representations?