grammatical / baselines-emnlp2016

Baseline models, training scripts, and instructions on how to reproduce our results for our state-of-art grammar correction system from M. Junczys-Dowmunt, R. Grundkiewicz: Phrase-based Machine Translation is State-of-the-Art for Automatic Grammatical Error Correction, EMNLP 2016.
MIT License
69 stars 17 forks source link

Suspicious casing while reproducing the conll14 results #6

Open shiman opened 5 years ago

shiman commented 5 years ago

Hi,

I want to reproduce the same (or at least very similar) m2 scores on the official conll14 test set. Following the README file, I successfully set up the environment and could get some results by the following command:

python2 models/run_gecsmt.py \
    -f models/moses.dense-cclm.mert.avg.ini \
    -w reproduce/ \
    -i conll14st-test/noalt/official-2014.combined.m2 \
    --m2 \
    -o reproduce/conll.out \
    --moses $PWD/build/mosesdecoder \
    --lazy $PWD/build/lazy \
    --scripts $PWD/train/scripts

The output file was supposed to be almost (if not exactly) the same with your submission, and so should the m2 scores be. However, I only got the following m2 scores:

Precision : 0.5977 Recall : 0.2794 F_0.5 : 0.4868

while the reported F0.5 is 0.4893, which is what I was expecting.

I vimdiffed my output against yours, and found that my output contained a few casing mistakes while yours doesn't. For example, in the middle part of sentence 333, my output was:

... doctors to disclose information To Patients Relatives.It challenges The Confidentiality and privacy principles.Currently , under the Health Insurance Portability and ...

The bolded tokens look suspicious. Here their first letters are all capitalized, but the original input is not. Your output looks fine, too.

I digged a little into the script: models/run_gecsmt.py, and realized maybe there is something wrong during the recasing phase? More specifically, at line 78:

https://github.com/grammatical/baselines-emnlp2016/blob/fbdb0e761c2eb110d736912c7327ec59021383bc/models/run_gecsmt.py#L77-L81

It looks like we are recasing the output (tokenized) using the raw input (untokenized) and the alignment file. I suspect this is incorrect because the alignment file is based on the tokenized files, and we should do something like this:

{scripts}/impose_case.perl {pfx}.in.tok {pfx}.out.tok.aln

I did try doing so. While I successfully got the correct cases for the example above, now all sentence beginning letters are in lowercase too.

This got me totally confused. How can I get the expected results and scores? What seems to be the problem? Could you shed some light?

For your reference, I also attached my output and logs here.

run.log conll.out.txt

emjotde commented 5 years ago

Hm, I seem to remember that uppercasing the fist letter of each sentence was part of the pipeline. @snukky is currently travelling, but will probably be able to take a look soon.

In the meantime try to apply this script to your output: https://github.com/marian-nmt/moses-scripts/blob/master/scripts/recaser/detruecase.perl

shiman commented 5 years ago

Thanks for the prompt response.

I tried restoring casing by aligning the output with the tokenized input, and adding detrucasing in the pipeline:

    # restore casing and tokenization
    run_cmd("cat {pfx}.out.tok" \
            " | {scripts}/impose_case.perl {pfx}.in.tok {pfx}.out.tok.aln" \
            " | {moses}/scripts/tokenizer/deescape-special-chars.perl" \
            " | {scripts}/impose_tok.perl {pfx}.in" \
            " | {moses}/scripts/recaser/detruecase.perl" \
            " > {pfx}.out"
        .format(pfx=prefix, scripts=args.scripts, moses=args.moses))

but the score is even worse:

Precision : 0.5876 Recall : 0.2800 F_0.5 : 0.4818

By comparing the results against yours, the differences are still mostly about casing. While the original run_gecsmt.py script looks suspicious (because of incorrect uppercasing), the idea of restoring casing by aligning with the tokenized input (as proposed in the previous post) doesn't seem right either, because the tokenized input ({pfx}.in.tok) is over-lowercasing. For example, the first line in the test set:

Keeping the Secret of Genetic Testing

was completely lowercased into:

keeping the secret of genetic testing

in the {pfx}.in.tok file. So it is impossible to recover the original case by aligning with it.

Thanks for your help anyway. Looking forward to getting some hints from @snukky .

snukky commented 5 years ago

I seem to remember that uppercasing the fist letter of each sentence was part of the pipeline.

It was done using that custom perl script, not Moses scripts, as we used lazy with LM for truecasing.

@shiman I'm not sure where the differences come from, but the script run_gecsmt.py was added later and has not been used to generate the outputs, so there can be some inconsistency. Nonetheless, detruecasing there seems to be the same as in the original training pipeline (https://github.com/grammatical/baselines-emnlp2016/blob/master/train/run_cross.perl#L675), where the provided outputs come from.

I'll check again when I get back home.