Closed GoogleCodeExporter closed 9 years ago
the bad thing is that detokenizer.perl is not using rules from
nonbreking_prefixes directory. It has some rules hard-coded inside itself.
Question is whether the quality is not affected if tokenizer.perl tokenize
segments in some way and detokenizer is not using the same scheme ...
If detokenizer.perl is not using same standard rules as tokenizer.perl I don't
see some reasonable way how to fix this bug. I don't want to modify
detokenizer.perl - since some special development moses branch would come to
existence ...
Original comment by xhu...@gmail.com
on 4 Mar 2011 at 10:55
True, the detokenizer does not observe the same rules as the tokenizer. But
this is the case with regular Moses tokenization/detokenization and I can't
recall any mentions of that on the Moses mailing list.
For now I think mod_detokenizer.pl should just check if the language is
cs|en|fr|it and if not call detokenizer.perl -l en ...
Original comment by Achi...@gmail.com
on 4 Mar 2011 at 3:29
done in r. 54
if(!($lang ~~ @l)){
print STDERR "WARNING: mod_detokenizer can't work with language: '$lang',falling back to 'en'\n";
$lang = "en";
}
system("perl detokenizer.perl -q -l $lang < $tmpout");
Original comment by xhu...@gmail.com
on 11 Mar 2011 at 9:39
Original comment by xhu...@gmail.com
on 11 Mar 2011 at 9:41
Original comment by Achi...@gmail.com
on 14 Mar 2011 at 2:53
Original issue reported on code.google.com by
Achi...@gmail.com
on 4 Mar 2011 at 3:07