Open gregtatum opened 1 month ago
I suggest using a different language pair for this experiment. en-ru was trained from a super convoluted branch "release_no_priors" where I had to change the graph by adding an extra step for alignments to do some bug fixes and not retrain everything from scratch. It's far behind main and doesn't have the latest W&B fixes, so I don't want to run any more experiments from the "release" based branches. If we switch to main, the graph will not be compatible, so we'll have to at least rerun the alignments step and reuse some other tasks using "existing_tasks". With all that it's a lot easier to run some other language pair we struggle with from main where we can reuse the tasks from release, for example, en-lt.
On another note, this looks like a hyperparameter search that we can do manually, but there are tools to automate it that we might explore in the future.
Ok, en-lt sounds like a great choice. I read a bit more on it and it's got a lot of qualitative feedback in #756.
Lithuanian has a similar use of declensions: https://en.wikipedia.org/wiki/Lithuanian_declension
I've got the first one started, and will wait until it gets to the student step before kicking off the rest:
https://firefox-ci-tc.services.mozilla.com/tasks/groups/Wxvkl1ruQkCj6URG6oIuuQ
The configs are each done on a commit-by-commit level: https://github.com/mozilla/translations/commits/dev-en-lt-decoder-size/
They are all in student training now on the dashboard. They are all the ones named decoder-*
.
So I misread the paper a bit when it was talking about decoders, and the ffn and embedding size affect both decoder and encoder equally. The decoder depth is the only parameter changed that affects the decoder. I'm updating my experiment notes accordingly.
I think transformer-dim-ffn
applies for both. But since the decoder is ssru
, which is a recurrent network, the params that are applied to the decoder are being the s2s
ones. So I guess to change the feed-forward size of the decoder, dim-rnn
has to be used?
In Ludicrously Fast Neural Machine Translation, they test a variety of decoder configurations for faster models.
In #174 @eu9ene showed that a larger model helps improve the COMET score for
en-ru
by +2.9, which is pretty significant.(Edit: I changed from en-ru to en-lt)
I'd like to test the parameters a bit more, as these changes are impactful in terms of quality, but also affect the performance of the model. The paper tested parameters on
en-de
, but our training ofen-ru
has struggled to gain the same amount of COMET with the same architecture. Rather than testingen-ru
, I'll do a clean run onen-lt
as it had a pretty low COMET score, and also features a much more varied morphology for the language due to its declension system. The idea is that the results will scale to other Balto-Slavic languages.I'm shortening the labels in the table a bit:
86.67
2
256
1536
88.78
+2.11
2
512
2048
3
256
1536
6
256
1536
2
256
2048
88.47
+1.80
2
512
1536
Links