Closed mfomicheva closed 9 years ago
Hi, could you please send us one of the files that presented problems? The treetagger can break for different reasons and it is difficult to say what is going on without an example. Kind regards, Carol
Thanks, Carol! Here is one of the files. newstest2014.onlineA.0.cs-en.txt
Hi, can you confirm the numbers of the features you are trying to extract? Are they 1086 and 1087? I have tried now and it did not show any problems. Just some quick questions (that can be the reason of the problem): are you using the latest version of QuEst++? Are you using UTF-8 treetagger scripts? There are some comments in the sentence-level config file: ! please use utf8 version of the tree-tagger scripts AND ! utf8-tokenize.perl version available in tree tagger scripts under "cmd" directory. ! TOKENIZER=${CMD}/tokenize.pl should be changed with TOKENIZER=${CMD}/utf8-tokenize.perl in cmd/tree-tagger-xxx script Are you using this configuration? Kind Regards, Carol
Hi, Carol. Yes, the numbers of features are 1086 and 1087. I've checked what you're asking and everything seems to be correct. I have tried using the tagger on its own and it works correctly. Actually, the missing lines are in the pos.XPOS file. The .pos file has no errors. So maybe the problem is with the wrapper. Also, when extracting the features I get several warnings like this: Failed to synchronize with tree-tagger's output on input line...
Hi, these warnings happen when you try to input data with special caracteres that QuEst cannot process. I took a look at your data and I did not find many of them. I tried again and I did not have any errors. Are you using tokenisation? I only got errors in the output when I removed the tokenisation option. In the case you are using command line, the tag for tokenisation is -tok. For true casing is -case true (lower or no). Sorry, but it is not in the documentation yet.
Hi Carol. I thought tokenisation was by default, so it didn't occur to me to check. With -tok it works perfectly. Thanks!
I am extracting sentence-level pos lm features for english. I've tried with different data, and the treetagger breaks on the last 7-10 sentences of some of the files and outputs empty lines.