ufal / neuralmonkey

An open-source tool for sequence learning in NLP built on TensorFlow.
BSD 3-Clause "New" or "Revised" License
411 stars 103 forks source link

Replicate NMT scores of Bahdanau et al. (2015) #68

Closed jindrahelcl closed 7 years ago

jindrahelcl commented 8 years ago

Use neuralmonkey to replicate the scores of Bahdanau et al. (2015) on English-to-French, WMT-14 translation task

This would mean to do the following things:

Anything else?

jindrahelcl commented 8 years ago

For replicaton, the domain adaptation is not required - the filtered WMT data are available on http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/

jlibovicky commented 8 years ago

I would expect the pseudo-in-domain-selection to have some scripts (probably in Perl) in the Moses code base. @ales-t was using it, he probably knows more.

It would be also good to claim in the paper, we can replicate a state-of-the-art result and provide a script that will prove it.

jindrahelcl commented 8 years ago

It would be also good to claim in the paper, we can replicate a state-of-the-art result and provide a script that will prove it.

Yes, that was my motivation for opening this issue.

jindrahelcl commented 8 years ago

It can prove challenging, though. I fear the model they used will not fit into any of our GPU's memory (I have a 6GB titan, same size as the cards on UFAL)

ales-t commented 8 years ago

About the in-domain selection: you need two language models compatible with SRILM, one trained on "guaranteed" in-domain data, the other trained on general/out-of-domain data. They should both have identical n-gram length and similar training data sizes. People also typically filter out singletons (replacing them with ) when doing this.

Once you have the LMs, you can run this simple tool to score your sentences:

https://redmine.ms.mff.cuni.cz/projects/ufal-smt-playground/repository/revisions/master/show/playground/tools/lmppl

On Wed, Jul 13, 2016 at 1:58 PM, Jindra Helcl notifications@github.com wrote:

It can prove challenging, though. I fear the model they used will not fit into any of our GPU's memory (I have a 6GB titan, same size as the cards on UFAL)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufal/neuralmonkey/issues/68#issuecomment-232334009, or mute the thread https://github.com/notifications/unsubscribe/AAPcpnIoNwDl_wLB2WPyOpI4yi7Vj1zgks5qVNLggaJpZM4JLR0m .

jindrahelcl commented 8 years ago

Thanks. Do you think it is applicable when the only real in-domain data I have is a very small dev set (~2k sentences)? (The in-domain part of the training dataset I have is very noisy and probably wouldn't do good)

Also, given the size of the in-domain corpus, is it then better to train the general language model on a random 2k-sentence-subset of the whole data than on a different subset?

ales-t commented 8 years ago

Hm, you can try it but I don't think this is enough data. Your OOV rate will be very high and generalization might be poor. It's hard to guess in advance though.

Ales

On Wed, Jul 27, 2016 at 4:34 PM, Jindra Helcl notifications@github.com wrote:

Thanks. Do you think it is applicable when the only real in-domain data I have is a very small dev set (~2k sentences)? (The in-domain part of the training dataset I have is very noisy and probably wouldn't do good)

Also, given the size of the in-domain corpus, is it then better to train the general language model on a random 2k-sentence-subset of the whole data than on a different subset?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufal/neuralmonkey/issues/68#issuecomment-235604452, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPcpp7sD75eufL7xv6-S3fd7sFqrSnLks5qZ2yAgaJpZM4JLR0m .

jindrahelcl commented 8 years ago

Attention decoder implementation now follows equations in the paper (PR #77).

jindrahelcl commented 8 years ago

As for the domain adaptation and tokenizer, I would just put a note (or possibly a script) in the experiment directory with instructions of how to obtain the en-fr wmt task filtered data (these are available) and how to tokenize the data.

The product of this issue will be a directory inside examples/ with ini file and scripts/instructions how to prepare the data.

jindrahelcl commented 7 years ago

Issues s beam searchem je tu hromada, tohle je jinak už hotový.