Closed jindrahelcl closed 7 years ago
For replicaton, the domain adaptation is not required - the filtered WMT data are available on http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/
I would expect the pseudo-in-domain-selection to have some scripts (probably in Perl) in the Moses code base. @ales-t was using it, he probably knows more.
It would be also good to claim in the paper, we can replicate a state-of-the-art result and provide a script that will prove it.
It would be also good to claim in the paper, we can replicate a state-of-the-art result and provide a script that will prove it.
Yes, that was my motivation for opening this issue.
It can prove challenging, though. I fear the model they used will not fit into any of our GPU's memory (I have a 6GB titan, same size as the cards on UFAL)
About the in-domain selection: you need two language models compatible with
SRILM, one trained on "guaranteed" in-domain data, the other trained on
general/out-of-domain data. They should both have identical n-gram length
and similar training data sizes. People also typically filter out
singletons (replacing them with
Once you have the LMs, you can run this simple tool to score your sentences:
On Wed, Jul 13, 2016 at 1:58 PM, Jindra Helcl notifications@github.com wrote:
It can prove challenging, though. I fear the model they used will not fit into any of our GPU's memory (I have a 6GB titan, same size as the cards on UFAL)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufal/neuralmonkey/issues/68#issuecomment-232334009, or mute the thread https://github.com/notifications/unsubscribe/AAPcpnIoNwDl_wLB2WPyOpI4yi7Vj1zgks5qVNLggaJpZM4JLR0m .
Thanks. Do you think it is applicable when the only real in-domain data I have is a very small dev set (~2k sentences)? (The in-domain part of the training dataset I have is very noisy and probably wouldn't do good)
Also, given the size of the in-domain corpus, is it then better to train the general language model on a random 2k-sentence-subset of the whole data than on a different subset?
Hm, you can try it but I don't think this is enough data. Your OOV rate will be very high and generalization might be poor. It's hard to guess in advance though.
Ales
On Wed, Jul 27, 2016 at 4:34 PM, Jindra Helcl notifications@github.com wrote:
Thanks. Do you think it is applicable when the only real in-domain data I have is a very small dev set (~2k sentences)? (The in-domain part of the training dataset I have is very noisy and probably wouldn't do good)
Also, given the size of the in-domain corpus, is it then better to train the general language model on a random 2k-sentence-subset of the whole data than on a different subset?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufal/neuralmonkey/issues/68#issuecomment-235604452, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPcpp7sD75eufL7xv6-S3fd7sFqrSnLks5qZ2yAgaJpZM4JLR0m .
Attention decoder implementation now follows equations in the paper (PR #77).
As for the domain adaptation and tokenizer, I would just put a note (or possibly a script) in the experiment directory with instructions of how to obtain the en-fr wmt task filtered data (these are available) and how to tokenize the data.
The product of this issue will be a directory inside examples/
with ini
file and scripts/instructions how to prepare the data.
Issues s beam searchem je tu hromada, tohle je jinak už hotový.
Use neuralmonkey to replicate the scores of Bahdanau et al. (2015) on English-to-French, WMT-14 translation task
This would mean to do the following things:
Anything else?