ghpaetzold / questplusplus

Pipelined quality estimation.
49 stars 14 forks source link

treetagger problem #15

Closed mfomicheva closed 8 years ago

mfomicheva commented 8 years ago

I am extracting sentence-level pos lm features for english. I've tried with different data, and the treetagger breaks on the last 7-10 sentences of some of the files and outputs empty lines.

carolscarton commented 8 years ago

Hi, could you please send us one of the files that presented problems? The treetagger can break for different reasons and it is difficult to say what is going on without an example. Kind regards, Carol

mfomicheva commented 8 years ago

Thanks, Carol! Here is one of the files. newstest2014.onlineA.0.cs-en.txt

carolscarton commented 8 years ago

Hi, can you confirm the numbers of the features you are trying to extract? Are they 1086 and 1087? I have tried now and it did not show any problems. Just some quick questions (that can be the reason of the problem): are you using the latest version of QuEst++? Are you using UTF-8 treetagger scripts? There are some comments in the sentence-level config file: ! please use utf8 version of the tree-tagger scripts AND ! utf8-tokenize.perl version available in tree tagger scripts under "cmd" directory. ! TOKENIZER=${CMD}/tokenize.pl should be changed with TOKENIZER=${CMD}/utf8-tokenize.perl in cmd/tree-tagger-xxx script Are you using this configuration? Kind Regards, Carol

mfomicheva commented 8 years ago

Hi, Carol. Yes, the numbers of features are 1086 and 1087. I've checked what you're asking and everything seems to be correct. I have tried using the tagger on its own and it works correctly. Actually, the missing lines are in the pos.XPOS file. The .pos file has no errors. So maybe the problem is with the wrapper. Also, when extracting the features I get several warnings like this: Failed to synchronize with tree-tagger's output on input line...

carolscarton commented 8 years ago

Hi, these warnings happen when you try to input data with special caracteres that QuEst cannot process. I took a look at your data and I did not find many of them. I tried again and I did not have any errors. Are you using tokenisation? I only got errors in the output when I removed the tokenisation option. In the case you are using command line, the tag for tokenisation is -tok. For true casing is -case true (lower or no). Sorry, but it is not in the documentation yet.

mfomicheva commented 8 years ago

Hi Carol. I thought tokenisation was by default, so it didn't occur to me to check. With -tok it works perfectly. Thanks!