nyu-mll / jiant-v1-legacy

The jiant toolkit for general-purpose text understanding models
MIT License
21 stars 9 forks source link

Updated MT branch #979

Open jeswan opened 4 years ago

jeswan commented 4 years ago

Issue by Kelina Thursday Dec 19, 2019 at 01:50 GMT Originally opened as https://github.com/nyu-mll/jiant/pull/979


Introducing a recurrent sequence-to-sequence model for MT.


Kelina included the following code: https://github.com/nyu-mll/jiant/pull/979/commits

jeswan commented 4 years ago

Comment by pep8speaks Thursday Dec 19, 2019 at 01:50 GMT


Hello @Kelina! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 1099:23: E702 multiple statements on one line (semicolon)

Line 170:43: E231 missing whitespace after ','

Line 2725:99: W291 trailing whitespace

Line 854:101: E501 line too long (108 > 100 characters)

You can repair most issues by installing black and running: black -l 100 ./*. If you contribute often, have a look at the 'Contributing' section of the README for instructions on doing this automatically.

Comment last updated at 2020-02-16 01:33:22 UTC
jeswan commented 4 years ago

Comment by sleepinyourhat Thursday Dec 19, 2019 at 02:33 GMT


Thanks! What results/sanity checks do you have so far?

jeswan commented 4 years ago

Comment by sleepinyourhat Thursday Dec 19, 2019 at 02:35 GMT


And more generally, what's the state of MT/seq2seq support right now? What's verified to work well, what should work, and are there any possible configurations to avoid?

jeswan commented 4 years ago

Comment by Kelina Thursday Dec 19, 2019 at 15:36 GMT


Thanks! What results/sanity checks do you have so far?

I trained one model with similar hyperparameters to https://arxiv.org/pdf/1508.04025.pdf. The new model obtained 21.12 BLEU on the validation set, so it's roughly comparable.

jeswan commented 4 years ago

Comment by Kelina Thursday Dec 19, 2019 at 15:39 GMT


And more generally, what's the state of MT/seq2seq support right now? What's verified to work well, what should work, and are there any possible configurations to avoid?

For now, morphological segmentation as in the respective config file works. MT works as in the respective config file as well; no other configurations have been tried.

jeswan commented 4 years ago

Comment by sleepinyourhat Friday Dec 20, 2019 at 15:08 GMT


Could you say more about the exact comparison you're making for WMT. Is it test vs. test? What's the target number you're trying to match?

jeswan commented 4 years ago

Comment by sleepinyourhat Tuesday Jan 07, 2020 at 14:34 GMT


Still work in progress, right?

jeswan commented 4 years ago

Comment by Kelina Sunday Jan 12, 2020 at 22:10 GMT


Still pending from before:

  • Could you say more about the exact comparison you're making for WMT. Test set vs test set? Dev set vs dev set? What's the exact target number you're trying to match, how close are you, and how does our setup differ?
  • If transformers (like RoBERTa) are supported, give an example config (or at least some instructions), if not, add an assert/error, and open an issue saying what we'd need to do to add support. We can't start our experiments without this.

I compared the development set results to reported (test set) results. We are slightly better, but our hyperparameters also differ slightly. From above: "I trained one model with similar hyperparameters to https://arxiv.org/pdf/1508.04025.pdf. The new model obtained 21.12 BLEU on the validation set, so it's roughly comparable."

The code runs with RoBERTa, but the results in my one-run experiment were worse (12.4 BLEU on dev); I am unsure if that's because of the hyperparameters. I added the config file I used for that.

jeswan commented 4 years ago

Comment by Kelina Sunday Jan 12, 2020 at 22:12 GMT


Still work in progress, right?

The open remaining points are probably resolved better and faster by either @pruksmhc or @HaokunLiu. I tagged them above in the respective comments.

jeswan commented 4 years ago

Comment by pruksmhc Sunday Feb 02, 2020 at 14:57 GMT


@Kelina where did you get the data? @sleepinyourhat perhaps you know as well? Seems like there's some train/test/val splits out there, but the preprocessing code in seq2seq tass here doesn't seem to handle those splits.

jeswan commented 4 years ago

Comment by pruksmhc Tuesday Feb 04, 2020 at 20:21 GMT


A number we should aim to _atleast have for RoBERTa is 27.14 BLEU (from https://openreview.net/pdf?id=Hyl7ygStwB),which is the number reported for using BERT to initialize the encoder of NMT). According to that same paper though, using BERT as an encoder actually drops performance (even when compared to the original transformers model for NMT). I don't think anyone has published results for RoBERTa for NMT yet (?). @sleepinyourhat I'm running experiments on RoBERTa now but seems like it's going to take a while. EDIT: Actually, I don't think anyone's tried BERT -> LSTM decoder directly. That paper i mentioned above used a Transformer architecture for decoder.

jeswan commented 4 years ago

Comment by jeswan Thursday Sep 17, 2020 at 16:10 GMT


Are these changes still necessary? Planning to close all PRs to move jiant2 to this repo in the near future.