Can not reach groundhog performance

critias commented 8 years ago

Hi, we are having a hard time to reproduce the results we got with GroundHog using Blocks. Given the exact same training data, vocabulary, test set and settings we are 3 Bleu points behind GroundHog on a German to English translation task. We tried many different setups and number of iterations, but we can't reach it.
The GroundHog translation costs also seem to have a higher correlation between good and bad sentences then blocks. e.g.:

"vielen Dank ." translated to "thank you ." a perfect translation and a common phrase which should have a low cost. GroundHog cost: 0.000250929 Blocks cost: 0.357417

"fliegende Katze ." is translated to "fly away , cat ." not wrong but kind of a strange/unusual sentence. GroundHog: 0.280177 Blocks cost: 0.267061

Blocks gives "thank you ." a higher cost to then "fly away , cat ." which seems strange to me. I take this as a hint that the problem is mainly related to the model and not to the search. The last comment here: https://github.com/kyunghyuncho/NMT/issues/21 seems to have the same issue. Has there been any progress on this? Any tips where the Blocks computation graph differs from the GroundHog graph (it's to large to just look at it an see a difference)? Or other hints what the problem could be?

Thanks,

orhanf commented 8 years ago

We were getting comparable scores for cs-en when the initial pr was made, around august so the issues in NMT repo might be outdated. iirc the there were fixes at beam-search which uses the generate computational graph (same one we generate samples).

Have you checked whether the cost computational graphs are generating the same cost or not (using the same batch and initial parameters)?

critias commented 8 years ago

Thanks for your fast response, we didn't try that yet. It's next on the list of things to try. Right now we are looking into something else, I let you know if we find something.

orhanf commented 8 years ago

Thanks, keep us posted

rizar commented 8 years ago

Henry Choi told me that he was able to reproduce English to French results with this implementation.

On 8 January 2016 at 15:32, Orhan Firat notifications@github.com wrote:

Thanks, keep us posted

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks-examples/issues/71#issuecomment-170115829 .

YilunLiu commented 8 years ago

@critias Hi, I am wondering did you reach the Groudhog performance. If you did, how did you reach that? I am trying the example as well and I cannot reach the performance.

critias commented 8 years ago

Hi, yes and no. We got roughly equal results on the validation set during training, but not after reloading the saved model. Since we changed the code base a little to reload and translate the model I guess the error is on our side. It's still kinda unclear and we have to look into this in more detail, but were busy with other things last week. Beside that we also try using orhanfs fork to see if his code to translate works better for us.

critias commented 8 years ago

It turned out the problem was on our side. We changed some minor parts of the code that caused a mismatch between the encoding used to create the vocabulary (just bytes) and the encoding used during training/translation (unicode). We are now able to reproduce the GroundHog results and even slightly surpassed it (0.4% Bleu). I'll close the issue. Thanks for your help and keep up the good work.

mila-iqia / blocks-examples

Can not reach groundhog performance #71