shuoyangd / tape4nmt

a ducttape workflow for neural machine translation
14 stars 6 forks source link

Sacrebleu #6

Closed mjpost closed 5 years ago

mjpost commented 5 years ago

This adds support for the sacrebleu goal, addressing #4.

This PR also:

Up next: sacrebleu can download all test sets, so we could get rid of the SGML stuff entirely. I am also going to add support for sockeye and sentencepiece.

mjpost commented 5 years ago

I merged in train because I am used to a setting where the "main.tconf" is a template that you edit to set (a) the datasets and (b) the training parameters. It's nice having it all in one file and on the other hand it can be a bit of a hassle to have to edit lots of different files.

I think we should move to having a common set of parameters like the above and then each implemented MT system can figure out how to implement them.

So if you're okay with folding confs/train.conf into main.tconf (like I've done here), maybe test it with fairseq and then merge it. Once this is on master, I'll add Sockeye.

shuoyangd commented 5 years ago

Just figuring out how each MT system would implement these common parameters is pretty difficult already. For example, Nematus has separate dropout parameters for embedding, hidden, source and target, while both fairseq and OpenNMT-py only has one. As another example, fairseq also has more parameters to control linear projections than OpenNMT-py, e.g., decoder-embed-dim and decoder-out-embed-dim.

I think trying to come up with a list that caters to all systems would either end up in a really long list, or people would find themselves keep adding parameters to do what they want. Either way, it adds an extra, unnecessary layer of obscurity over the already complicated toolkits that in my opinion should be avoided.

So in short I'm not quite convinced that folding in confs/train.tconf is a good idea.

mjpost commented 5 years ago

So do you want perfect separation of training parameters? Even if they are redundant?

Also, how to name parameters is separate from where they should lie. The reason I moved them to main.tconf is because currently, to setup an experiment requires editing three or four files. Even if we don't share training parameters across toolkits, it would be nice to have them in the main file for easy editing and documentation. If we keep them separate, we might also want to adopt a naming convention like train_TOOLKIT_*.

shuoyangd commented 5 years ago

Yes I would prioritize clarity (i.e. separation) above redundancy.

If we were to keep separate files, the train_TOOLKIT_* naming convention would be nice, so the current train.tape should be renamed into train_fairseq.

I’m not sure whether having a single, ultra-long tconf is a good idea. If we do so, can we keep different groups of training parameters in separate blocks so we can easily import a group depending on what we need?

mjpost commented 5 years ago

I think separation and redundancy are separate issues. We can have the params in a separate file or the main tconf; and we can try to have params with the same names or not.

I have pulled this out into separate files to get this merged, but will probably argue for a merging later. I have also added basic sockeye support.

Right now, to choose your toolkit, you have to:

The test case passes in Sockeye; I don't have fairseq installed to test it.

shuoyangd commented 5 years ago

Alright. I'm open to revisiting this issue later. I'll test fairseq and have this merged.