moses-smt / mosesdecoder

Moses, the machine translation system
http://www.statmt.org/moses
GNU Lesser General Public License v2.1
1.59k stars 778 forks source link

Patching truecaser #206

Closed alvations closed 5 years ago

alvations commented 5 years ago

The truecaser perl scripts are improved by

alvations commented 5 years ago

Although now the truecasing models are the same when trained on different runs, the output printing is randomized by the hashes of the array.

Should we standardize and call a sort before that?

hieuhoang commented 5 years ago

cheers. it looks ok. I'm not set up to test it but you seem to know what you're doing

Will pull in a few days if no-one objects

hieuhoang commented 5 years ago

Although now the truecasing models are the same when trained on different runs, the output printing is randomized by the hashes of the array.

Should we standardize and call a sort before that?

Up to you. I personally wouldn't as what's needed in debugging output depends on what you're debugging. Also, you're responsible for your code so less is better :)

alvations commented 5 years ago

One more thing I didn't try to patch is:

    my $line = $_[0];
    chomp($line);
    $line =~ s/^\s+//;
    $line =~ s/\s+$//;

It's just stripping the leading and trailing spaces but I can't seem to find a cleaner way to do that without extra import or chaining the regexes into something even more unreadable, so I left it as it is.

If anyone has a better solution, please do suggest here =)

hieuhoang commented 5 years ago

cheers. I've pulled the changes but then decided to train a de-en system, Your code seems to produce lower BLEU than the original truecaser.

Liling TC dev2006: 26.90 (0.999) BLEU-c ; 28.20 (0.999) BLEU
  devtest2006: 27.00 (0.998) BLEU-c ; 28.40 (0.998) BLEU
  nc-dev2007: 21.30 (1.067) BLEU-c ; 23.30 (1.067) BLEU
  nc-devtest2007: 19.60 (1.106) BLEU-c ; 21.20 (1.106) BLEU
  test2006: 26.90 (0.998) BLEU-c ; 28.40 (0.998) BLEU
  avg: 50.24 BLEU
   
old TC dev2006: 27.60 (1.000) BLEU-c ; 28.30 (1.000) BLEU
  devtest2006: 27.60 (0.999) BLEU-c ; 28.30 (0.999) BLEU
  nc-dev2007: 21.60 (1.073) BLEU-c ; 22.80 (1.073) BLEU
  nc-devtest2007: 20.00 (1.105) BLEU-c ; 20.90 (1.105) BLEU
  test2006: 27.60 (0.998) BLEU-c ; 28.50 (0.998) BLEU
  avg: 50.64 BLEU

Looking at the differences in the training data:

  1. old: FALCONE , Attali , J. C. Mitterrand , Sulitzer und andere , gegen die in der gleichen Sache ermittelt wurde , wurden vorläufig festgenommen , unter Maßnahmen der richterlichen Aufsicht gestellt oder gegen Kaution freigelassen . Liling: Falcone , Attali , J. C. Mitterrand , Sulitzer und andere , gegen die in der gleichen Sache ermittelt wurde , wurden vorläufig festgenommen , unter Maßnahmen der richterlichen Aufsicht gestellt oder gegen Kaution freigelassen .

  2. old: Vergeßt nicht das serbische Volk , das bis heute im Kosovo geblieben ist . Liling: vergeßt nicht das serbische Volk , das bis heute im Kosovo geblieben ist .

Do you have a explanation for the difference? Did you do similar training comparisons?

hieuhoang commented 5 years ago

@alvations Gonna revert soon unless you can figure out the reason for the lower BLEU

alvations commented 5 years ago

Hmmm, I haven't done training comparisons but I did compare results of models from both original and edited truecaser and they are the same. Maybe my dataset didn't have the that many xmls.

But I suspect that's why the train_truecaser.perl has a different xml_split() function that doesn't check for possible XML but in the applying the truecaser.perl it does.

That might have caused the difference but BLEU is right, better to keep it different.

alvations commented 5 years ago

Reverted the split_xml(), the cased BLEU should go back.