Closed alvations closed 5 years ago
Although now the truecasing models are the same when trained on different runs, the output printing is randomized by the hashes of the array.
Should we standardize and call a sort before that?
cheers. it looks ok. I'm not set up to test it but you seem to know what you're doing
Will pull in a few days if no-one objects
Although now the truecasing models are the same when trained on different runs, the output printing is randomized by the hashes of the array.
Should we standardize and call a sort before that?
Up to you. I personally wouldn't as what's needed in debugging output depends on what you're debugging. Also, you're responsible for your code so less is better :)
One more thing I didn't try to patch is:
my $line = $_[0];
chomp($line);
$line =~ s/^\s+//;
$line =~ s/\s+$//;
It's just stripping the leading and trailing spaces but I can't seem to find a cleaner way to do that without extra import or chaining the regexes into something even more unreadable, so I left it as it is.
If anyone has a better solution, please do suggest here =)
cheers. I've pulled the changes but then decided to train a de-en system, Your code seems to produce lower BLEU than the original truecaser.
Liling TC | dev2006: 26.90 (0.999) BLEU-c ; 28.20 (0.999) BLEU |
---|---|
devtest2006: 27.00 (0.998) BLEU-c ; 28.40 (0.998) BLEU | |
nc-dev2007: 21.30 (1.067) BLEU-c ; 23.30 (1.067) BLEU | |
nc-devtest2007: 19.60 (1.106) BLEU-c ; 21.20 (1.106) BLEU | |
test2006: 26.90 (0.998) BLEU-c ; 28.40 (0.998) BLEU | |
avg: 50.24 BLEU | |
old TC | dev2006: 27.60 (1.000) BLEU-c ; 28.30 (1.000) BLEU |
devtest2006: 27.60 (0.999) BLEU-c ; 28.30 (0.999) BLEU | |
nc-dev2007: 21.60 (1.073) BLEU-c ; 22.80 (1.073) BLEU | |
nc-devtest2007: 20.00 (1.105) BLEU-c ; 20.90 (1.105) BLEU | |
test2006: 27.60 (0.998) BLEU-c ; 28.50 (0.998) BLEU | |
avg: 50.64 BLEU |
Looking at the differences in the training data:
old: FALCONE , Attali , J. C. Mitterrand , Sulitzer und andere , gegen die in der gleichen Sache ermittelt wurde , wurden vorläufig festgenommen , unter Maßnahmen der richterlichen Aufsicht gestellt oder gegen Kaution freigelassen . Liling: Falcone , Attali , J. C. Mitterrand , Sulitzer und andere , gegen die in der gleichen Sache ermittelt wurde , wurden vorläufig festgenommen , unter Maßnahmen der richterlichen Aufsicht gestellt oder gegen Kaution freigelassen .
old: Vergeßt nicht das serbische Volk , das bis heute im Kosovo geblieben ist . Liling: vergeßt nicht das serbische Volk , das bis heute im Kosovo geblieben ist .
Do you have a explanation for the difference? Did you do similar training comparisons?
@alvations Gonna revert soon unless you can figure out the reason for the lower BLEU
Hmmm, I haven't done training comparisons but I did compare results of models from both original and edited truecaser and they are the same. Maybe my dataset didn't have the that many xmls.
But I suspect that's why the train_truecaser.perl
has a different xml_split() function that doesn't check for possible XML but in the applying the truecaser.perl
it does.
That might have caused the difference but BLEU is right, better to keep it different.
Reverted the split_xml(), the cased BLEU should go back.
The truecaser perl scripts are improved by
standardizing the
split_xml()
function for both training and using (i.e.train-truecaser.perl
and truecaser.perl`)The default
ucfirst
can capitalize the first character of a string without awkward regexes. Also, comments are corrected for the capitalization of the first character of a string, previously it was stating the function as uppercase without any specification of what it's uppercasing.