sigmorphon / 2020

SIGMORPHON 2020 Shared Task: Grapheme-to-Phoneme, Unsupervised Induction of Morphology, and Typologically Diverse Morphological Inflection
34 stars 12 forks source link

Results for baseline systems of Task 1 available somewhere? #6

Open simon-clematide opened 4 years ago

simon-clematide commented 4 years ago

Hi there as the sweep scripts of the neural baselines really explore a large amount of hyperparameters, wouldn't it make sense to save some energy and make the results of the baselines public?

kylebgorman commented 4 years ago

I agree, and we will shortly!

As we speak we are improving them on a few dimensions: we're tightening the grid for some hyperparameters and expanding it for a few others.

I think these should be ready in the next week or so, about the same time the surprise languages are ready.

simon-clematide commented 4 years ago

That would be great, but probably a bit late for most of us.

kylebgorman commented 4 years ago

We'll try to get it out as soon as possible. I have FST results just sitting here (just need to post them), and should be able to at least get the encoder-decoder ("LSTM") results up in the next few days. We're a bit behind on tuning experiments (still refining them a bit) because I don't have access to my lab due to social distancing.

On Wed, Apr 8, 2020 at 4:43 PM simon-clematide notifications@github.com wrote:

That would be grate, but probably a bit late for most of us when the surprise languages come out.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sigmorphon/2020/issues/6#issuecomment-611184331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OIA5MW46ZM26UQYIA3RLTOYLANCNFSM4MDOTSUA .

simon-clematide commented 4 years ago

I can share some of the things that we computed. In our experience the transformer is probably pretty strong. For a specific setup (see the original baseline scripts to interpret the meaning of the hyperparameters) that we tested, we got the following results:

checkpoints/arm-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 17.78 LER: 3.62
checkpoints/bul-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 30.67 LER: 7.01
checkpoints/fre-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 8.44 LER: 2.08
checkpoints/geo-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 28.44 LER: 6.04
checkpoints/gre-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 18.89 LER: 3.36
checkpoints/hin-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 8.44 LER: 2.32
checkpoints/hun-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 3.78 LER: 0.66
checkpoints/ice-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 11.56 LER: 2.86
checkpoints/kor-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 44.22 LER: 18.49
checkpoints/lit-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 22.67 LER: 4.63

Maybe this is gives a hint on the difficulty of the different data sets.

kylebgorman commented 4 years ago

Thanks for sharing. We're getting slightly better numbers yet by tuning "smarter" (though not more) and that should be finalized in a few days.

On Thu, Apr 9, 2020 at 10:40 AM simon-clematide notifications@github.com wrote:

I can share some of the things that we computed. In our experience the transformer is probably pretty strong. For a specific setup (see the original baseline scripts to interpret the meaning of the hyperparameters) that we tested, we got the following results:

checkpoints/arm-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 17.78 LER: 3.62 checkpoints/bul-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 30.67 LER: 7.01 checkpoints/fre-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 8.44 LER: 2.08 checkpoints/geo-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 28.44 LER: 6.04 checkpoints/gre-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 18.89 LER: 3.36 checkpoints/hin-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 8.44 LER: 2.32 checkpoints/hun-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 3.78 LER: 0.66 checkpoints/ice-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 11.56 LER: 2.86 checkpoints/kor-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 44.22 LER: 18.49 checkpoints/lit-256-1024-4-4-256-1024-4-4-0.3/checkpoint_best.pt WER: 22.67 LER: 4.63

Maybe this is gives a hint on the difficulty of the different data sets.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sigmorphon/2020/issues/6#issuecomment-611564552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPF4FADZZQHR6MMGNDRLXM4DANCNFSM4MDOTSUA .

besou commented 4 years ago

Thank you for the baseline results so far. The results are pretty strong, especially for the Enc-Dec baseline. Would you mind publishing the hyperparameter combinations of the most successful baseline models as well?

kylebgorman commented 4 years ago

Sure, I'll add that to the spreadsheet. I save the name of the checkpoint, from which you can derive the hyperparameters.

On Thu, Apr 30, 2020 at 3:28 AM besou notifications@github.com wrote:

Thank you for the baseline results so far. The results are pretty strong, especially for the Enc-Dec baseline. Would you mind publishing the hyperparameter combinations of the most successful baseline models as well?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sigmorphon/2020/issues/6#issuecomment-621663307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OOPLSRDK3ER4QXNV2DRPESB5ANCNFSM4MDOTSUA .

kylebgorman commented 4 years ago

Hi @besou, I may have spoken too soon. I'm still running the final sweep (including results on test) and it won't finish for a few days, so I won't have these in time for you to act on them. (I am locked out of my lab with all the GPUs due to the pandemic and associated social distancing.)

There are quite a bit of variation in what works for a given language: some prefer small batches, some large, whether you want a "small" or a "large" encoder and/or decoder varies; nearly all prefer a moderate degree dropout.