Patching truecaser - Githubissues

alvations commented 5 years ago

The truecaser perl scripts are improved by

standardizing the split_xml() function for both training and using (i.e. train-truecaser.perl and truecaser.perl`)
The default ucfirst can capitalize the first character of a string without awkward regexes. Also, comments are corrected for the capitalization of the first character of a string, previously it was stating the function as uppercase without any specification of what it's uppercasing.

alvations commented 5 years ago

Although now the truecasing models are the same when trained on different runs, the output printing is randomized by the hashes of the array.

Should we standardize and call a sort before that?

hieuhoang commented 5 years ago

cheers. it looks ok. I'm not set up to test it but you seem to know what you're doing

Will pull in a few days if no-one objects

hieuhoang commented 5 years ago

Although now the truecasing models are the same when trained on different runs, the output printing is randomized by the hashes of the array.

Should we standardize and call a sort before that?

Up to you. I personally wouldn't as what's needed in debugging output depends on what you're debugging. Also, you're responsible for your code so less is better :)

alvations commented 5 years ago

One more thing I didn't try to patch is:

    my $line = $_[0];
    chomp($line);
    $line =~ s/^\s+//;
    $line =~ s/\s+$//;

It's just stripping the leading and trailing spaces but I can't seem to find a cleaner way to do that without extra import or chaining the regexes into something even more unreadable, so I left it as it is.

If anyone has a better solution, please do suggest here =)

hieuhoang commented 5 years ago

cheers. I've pulled the changes but then decided to train a de-en system, Your code seems to produce lower BLEU than the original truecaser.

Liling TC	dev2006: 26.90 (0.999) BLEU-c ; 28.20 (0.999) BLEU
	devtest2006: 27.00 (0.998) BLEU-c ; 28.40 (0.998) BLEU
	nc-dev2007: 21.30 (1.067) BLEU-c ; 23.30 (1.067) BLEU
	nc-devtest2007: 19.60 (1.106) BLEU-c ; 21.20 (1.106) BLEU
	test2006: 26.90 (0.998) BLEU-c ; 28.40 (0.998) BLEU
	avg: 50.24 BLEU

old TC	dev2006: 27.60 (1.000) BLEU-c ; 28.30 (1.000) BLEU
	devtest2006: 27.60 (0.999) BLEU-c ; 28.30 (0.999) BLEU
	nc-dev2007: 21.60 (1.073) BLEU-c ; 22.80 (1.073) BLEU
	nc-devtest2007: 20.00 (1.105) BLEU-c ; 20.90 (1.105) BLEU
	test2006: 27.60 (0.998) BLEU-c ; 28.50 (0.998) BLEU
	avg: 50.64 BLEU

Looking at the differences in the training data:

old: FALCONE , Attali , J. C. Mitterrand , Sulitzer und andere , gegen die in der gleichen Sache ermittelt wurde , wurden vorläufig festgenommen , unter Maßnahmen der richterlichen Aufsicht gestellt oder gegen Kaution freigelassen . Liling: Falcone , Attali , J. C. Mitterrand , Sulitzer und andere , gegen die in der gleichen Sache ermittelt wurde , wurden vorläufig festgenommen , unter Maßnahmen der richterlichen Aufsicht gestellt oder gegen Kaution freigelassen .
old: Vergeßt nicht das serbische Volk , das bis heute im Kosovo geblieben ist . Liling: vergeßt nicht das serbische Volk , das bis heute im Kosovo geblieben ist .

Do you have a explanation for the difference? Did you do similar training comparisons?

hieuhoang commented 5 years ago

@alvations Gonna revert soon unless you can figure out the reason for the lower BLEU

alvations commented 5 years ago

Hmmm, I haven't done training comparisons but I did compare results of models from both original and edited truecaser and they are the same. Maybe my dataset didn't have the that many xmls.

But I suspect that's why the train_truecaser.perl has a different xml_split() function that doesn't check for possible XML but in the applying the truecaser.perl it does.

That might have caused the difference but BLEU is right, better to keep it different.

alvations commented 5 years ago

Reverted the split_xml(), the cased BLEU should go back.

moses-smt / mosesdecoder

Patching truecaser #206