redpony / cdec

Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms
http://cdec-decoder.org/
Apache License 2.0
183 stars 77 forks source link

Missing translations from k-best output #21

Closed kho closed 11 years ago

kho commented 11 years ago

Steps to reproduce the problem:

  1. Run cdec with the configuration and weights from the "australia" test in system tests, and use the LM at http://www.umiacs.umd.edu/~wuke/australia/lm.klm;
  2. Feed http://www.umiacs.umd.edu/~wuke/australia/input.txt as input;
  3. Ask for 10-best unique output.
cdec -c cdec.ini -w weights -F 'KLanguageModel lm.klm' -i input.txt -k 10 -r

Expected output:

0 ||| australia is one of the few countries have diplomatic relations with north korea . ||| PhraseModel_0=4.60735 PhraseModel_1=8.59148 PhraseModel_2=7.08265 LanguageModel=-19.3358 Glue=2 ||| -34.8916
0 ||| australia is one of the few countries have diplomatic relations with pyongyang . ||| PhraseModel_0=5.20941 PhraseModel_1=9.11981 PhraseModel_2=7.79929 LanguageModel=-19.0969 Glue=2 ||| -36.1151
0 ||| australia is one of few countries have diplomatic relations with north korea . ||| PhraseModel_0=5.96908 PhraseModel_1=9.10146 PhraseModel_2=6.75465 LanguageModel=-19.1314 Glue=2 ||| -36.3301
0 ||| australia is one of the few countries relations with north korea . ||| PhraseModel_0=4.60735 PhraseModel_1=10.631 PhraseModel_2=5.5937 LanguageModel=-20.5781 Glue=2 ||| -36.79
0 ||| australia is one of the few countries have diplomatic relations with north . ||| PhraseModel_0=5.77816 PhraseModel_1=8.59148 PhraseModel_2=6.42003 LanguageModel=-20.9716 Glue=2 ||| -37.3857
0 ||| australia is one of few countries have diplomatic relations with pyongyang . ||| PhraseModel_0=6.57114 PhraseModel_1=9.62979 PhraseModel_2=7.47129 LanguageModel=-18.8925 Glue=2 ||| -37.5537
0 ||| australia relations with north korea is one of the few countries . ||| PhraseModel_0=4.77471 PhraseModel_1=10.631 PhraseModel_2=5.5937 LanguageModel=-21.1795 Glue=3 ||| -37.5699
0 ||| australia is one of few countries with diplomatic relations with north korea . ||| PhraseModel_0=6.98959 PhraseModel_1=9.66157 PhraseModel_2=7.17157 LanguageModel=-18.7282 Glue=2 ||| -37.683
0 ||| australia 's relations with north korea is one of the few countries . ||| PhraseModel_0=6.4107 PhraseModel_1=10.631 PhraseModel_2=7.4594 LanguageModel=-18.5078 Glue=3 ||| -37.744
0 ||| australia is one of the few countries with diplomatic relations with north korea . ||| PhraseModel_0=7.13571 PhraseModel_1=9.15159 PhraseModel_2=7.49957 LanguageModel=-18.9326 Glue=2 ||| -37.8531
1 ||| 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| LanguageModel=-18.9326 Glue=2 PhraseModel_0=7.13571 PhraseModel_1=9.15159 PhraseModel_2=7.49957 ||| -37.8531
2 ||| 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| LanguageModel=-18.7282 Glue=2 PhraseModel_0=6.98959 PhraseModel_1=9.66157 PhraseModel_2=7.17157 ||| -37.683

Actual output:

0 ||| australia is one of the few countries have diplomatic relations with north korea . ||| LanguageModel=-19.3358 Glue=2 PhraseModel_0=4.60735 PhraseModel_1=8.59148 PhraseModel_2=7.08265 ||| -34.8916
0 ||| australia is one of the few countries have diplomatic relations with pyongyang . ||| LanguageModel=-19.0969 Glue=2 PhraseModel_0=5.20941 PhraseModel_1=9.11981 PhraseModel_2=7.79929 ||| -36.1151
0 ||| australia is one of few countries have diplomatic relations with north korea . ||| LanguageModel=-19.1314 Glue=2 PhraseModel_0=5.96908 PhraseModel_1=9.10146 PhraseModel_2=6.75465 ||| -36.3301
0 ||| australia is one of the few countries relations with north korea . ||| LanguageModel=-20.5781 Glue=2 PhraseModel_0=4.60735 PhraseModel_1=10.631 PhraseModel_2=5.5937 ||| -36.79
0 ||| australia is one of the few countries have diplomatic relations with north . ||| LanguageModel=-20.9716 Glue=2 PhraseModel_0=5.77816 PhraseModel_1=8.59148 PhraseModel_2=6.42003 ||| -37.3857
0 ||| australia is one of few countries have diplomatic relations with pyongyang . ||| LanguageModel=-18.8925 Glue=2 PhraseModel_0=6.57114 PhraseModel_1=9.62979 PhraseModel_2=7.47129 ||| -37.5537
0 ||| australia relations with north korea is one of the few countries . ||| LanguageModel=-21.1795 Glue=3 PhraseModel_0=4.77471 PhraseModel_1=10.631 PhraseModel_2=5.5937 ||| -37.5699
0 ||| australia 's relations with north korea is one of the few countries . ||| LanguageModel=-18.5078 Glue=3 PhraseModel_0=6.4107 PhraseModel_1=10.631 PhraseModel_2=7.4594 ||| -37.744
0 ||| australia is a state of relations with north korea . ||| LanguageModel=-16.1582 Glue=2 PhraseModel_0=7.87126 PhraseModel_1=13.8172 PhraseModel_2=4.95146 ||| -37.8703
0 ||| australia was one of the few countries have diplomatic relations with north korea . ||| LanguageModel=-20.1128 Glue=3 PhraseModel_0=6.17237 PhraseModel_1=8.82984 PhraseModel_2=7.76171 ||| -37.9181
1 ||| 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| LanguageModel=-18.9326 Glue=2 PhraseModel_0=7.13571 PhraseModel_1=9.15159 PhraseModel_2=7.49957 ||| -37.8531
2 ||| 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| LanguageModel=-18.7282 Glue=2 PhraseModel_0=6.98959 PhraseModel_1=9.66157 PhraseModel_2=7.17157 ||| -37.683

What's different

The 10-th best translation from cdec has a score of -37.9181; but there are two translations with higher scores that do not show up, namely,

0 ||| australia is one of few countries with diplomatic relations with north korea . ||| PhraseModel_0=6.98959 PhraseModel_1=9.66157 PhraseModel_2=7.17157 LanguageModel=-18.7282 Glue=2 ||| -37.683
0 ||| australia is one of the few countries with diplomatic relations with north korea . ||| PhraseModel_0=7.13571 PhraseModel_1=9.15159 PhraseModel_2=7.49957 LanguageModel=-18.9326 Glue=2 ||| -37.8531

As we can see from the output of forced decoding (seg 1 and 2), they are reachable in the LM pruned forest.

redpony commented 11 years ago

Thanks for the heads up about this. I'm looking into it.

redpony commented 11 years ago

Found the problem. The code was not adding "derivation successors" to the vertex's priority queue for derivations that rederived the same string. You should not return derivations that derive the same string but it is important to add the derivation successors to the queue since otherwise you might miss part of the search space, as was happening here.

Extracting all k-best lists with bug | wc -l: 1948 Extracting all k-best lists and sort -u | wc -l: 5242 Extracting all k-best lists with fix | wc -l: 5242