lvapeab / m4loc

Automatically exported from code.google.com/p/m4loc
GNU Lesser General Public License v3.0
0 stars 0 forks source link

mod_detokenizer: Moses detokenizer only accepts cs|en|fr|it as input languages #12

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
1. Produce(or edit) tokenized target language file e.g. 
languagetool.xlf.ins.fr-fr
2. Run "perl mod_detokenizer.pl -l fr-fr < languagetool.xlf.ins.fr-fr"

Result:
Error: "No built-in rules for language fr-fr, claim en for default behaviour. 
at ./detokenizer.perl line 37."

Expected:
Automatic fallback to en. The modified tokenizer should check the language and 
if it is not supported fall back to English (en). Ideally the detokenizer would 
do that already, but we are picking up the script unmodified from Moses.

Original issue reported on code.google.com by Achi...@gmail.com on 4 Mar 2011 at 3:07

GoogleCodeExporter commented 9 years ago
the bad thing is that detokenizer.perl is not using rules from 
nonbreking_prefixes directory. It has some rules hard-coded inside itself.
Question is whether the quality is not affected if tokenizer.perl tokenize 
segments in some way and detokenizer is not using the same scheme ...

If detokenizer.perl is not using same standard rules  as tokenizer.perl I don't 
see some reasonable way how to fix this bug. I don't want to modify 
detokenizer.perl - since some special development moses branch would come to 
existence ...

Original comment by xhu...@gmail.com on 4 Mar 2011 at 10:55

GoogleCodeExporter commented 9 years ago
True, the detokenizer does not observe the same rules as the tokenizer. But 
this is the case with regular Moses tokenization/detokenization and I can't 
recall any mentions of that on the Moses mailing list.

For now I think mod_detokenizer.pl should just check if the language is 
cs|en|fr|it and if not call detokenizer.perl -l en ...

Original comment by Achi...@gmail.com on 4 Mar 2011 at 3:29

GoogleCodeExporter commented 9 years ago
done in r. 54

if(!($lang ~~ @l)){
    print STDERR "WARNING: mod_detokenizer can't work with language: '$lang',falling back to 'en'\n";

    $lang = "en";
    }
system("perl detokenizer.perl -q -l $lang < $tmpout");

Original comment by xhu...@gmail.com on 11 Mar 2011 at 9:39

GoogleCodeExporter commented 9 years ago

Original comment by xhu...@gmail.com on 11 Mar 2011 at 9:41

GoogleCodeExporter commented 9 years ago

Original comment by Achi...@gmail.com on 14 Mar 2011 at 2:53