Command to reproduce results on MuST-C failing

Chaitanya-git commented 4 years ago

Running the command provided in the readme to reproduce the results on MuST-C of the paper "Adapting Transformer to End-to-End Spoken Language Translation" results in the following error:

| distributed init (rank 1): tcp://localhost:18735
| distributed init (rank 0): tcp://localhost:18735
| distributed init (rank 2): tcp://localhost:18735
| distributed init (rank 3): tcp://localhost:18735
Namespace(adam_betas='(0.9, 0.999)', adam_eps=1e-08, arch='speechconvtransformer_big', attention_dropout=0.1, attn_2d=True, audio_input=True, bucket_cap_mb=150, clip_norm=20.0, criterion='label_smoothed_cross_entropy', data=['bin/'], ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_ffn_embed_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, decoder_out_embed_dim=512, decoder_output_dim=512, device_id=0, distance_penalty='gauss', distributed_backend='nccl', distributed_init_host='localhost', distributed_init_method='tcp://localhost:18735', distributed_port=18736, distributed_rank=0, distributed_world_size=4, dropout=0.1, encoder_attention_heads=8, encoder_convolutions='[(64, 3, 3)] * 2', encoder_embed_dim=512, encoder_ffn_embed_dim=1024, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=True, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_window=None, freeze_encoder=False, init_variance=1.0, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.005], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=100, max_sentences=8, max_sentences_valid=8, max_source_positions=1400, max_target_positions=300, max_tokens=12000, max_update=0, min_loss_scale=0.0001, min_lr=1e-08, momentum=0.99, no_attn_2d=False, no_cache_source=False, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, normalization_constant=1.0, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.1, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='models', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=True, skip_invalid_size_inputs_valid_test=True, source_lang=None, target_lang=None, task='translation', train_subset='train', update_freq=[16], upsample_primary=1, valid_subset='valid', validate_interval=1, warmup_init_lr=0.0003, warmup_updates=4000, weight_decay=0.0)
| [h5] dictionary: 4 types
| [de] dictionary: 192 types
| bin/ train 229703 examples
| bin/ valid 1423 examples
Exception ignored in: <function IndexedDataset.__del__ at 0x7f0de0de5790>
Traceback (most recent call last):
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/fairseq/data/indexed_dataset.py", line 85, in __del__
Traceback (most recent call last):
  File "../../train.py", line 365, in <module>
Exception ignored in: <function IndexedDataset.__del__ at 0x7f9f0b8f3790>
Traceback (most recent call last):
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/fairseq/data/indexed_dataset.py", line 85, in __del__
    def __del__(self):
KeyboardInterrupt: 
    multiprocessing_main(args)
    def __del__(self):
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/multiprocessing_train.py", line 42, in main
KeyboardInterrupt: 
    p.join()
  File "/home/amit/.pyenv/versions/3.8.2/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/amit/.pyenv/versions/3.8.2/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/home/amit/.pyenv/versions/3.8.2/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/multiprocessing_train.py", line 84, in signal_handler
    raise Exception(msg)
Exception: 

-- Tracebacks above this line can probably be ignored --

Traceback (most recent call last):
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/multiprocessing_train.py", line 48, in run
    single_process_main(args)
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/train.py", line 53, in main
    dummy_batch = task.dataset('train').get_dummy_batch(args.max_tokens, max_positions)
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/fairseq/data/language_pair_dataset.py", line 221, in get_dummy_batch
    return self.collater([
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/fairseq/data/language_pair_dataset.py", line 224, in <listcomp>
    'source': self.src_dict.dummy_sentence(src_len) if self.src_dict is not None else None,
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/fairseq/data/dictionary.py", line 302, in dummy_sentence
    t = torch.Tensor(length).new_empty((length, self.audio_features)).uniform_(self.nspecial + 1, len(self))
RuntimeError: Expected a_in <= b_in to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)

I've tried running the command with Python 3.5 and Python 3.8 and I get the same error both times I believe the error is caused because the parameters being passed to torch::nn::init::uniform_ are incorrect.

I tried fixing the error myself by changing self.nspecial + 1 to self.nspecial in the following line https://github.com/mattiadg/FBK-Fairseq-ST/blob/2d152404df1ffce944d6bc11f3fb8361fb4810f7/fairseq/data/dictionary.py#L302

Is this a valid fix?

Thanks in advance, Chaitanya

mattiadg commented 4 years ago

What version of pytorch are you using? Is the training script that causes the failure?

Chaitanya-git commented 4 years ago

I'm on pytorch 1.4.0. It is the training script that fails

bhaddow commented 4 years ago

I have the same problem. This is happening in AudioDictionary.dummysentence() where (I think) the code is creating a dummy audio segment, and trying to initialise it with `uniform(self.nspecial + 1, len(self)) . This gives a uniform distribution between 5 and 4, and so hits the assertion - note that the AudioDictionary only has the 4 special symbols. Since (presumably) the symbols don't make sense for the AudioDictionary, I replaced with_uniform(0,1)`, and it continues without encountering the error above -- albeit with many other unrelated warnings.

mattiadg commented 4 years ago

I am using it with pytorch 1.1.0 and I don't get that error. Would you mind trying it with this version? As of now, I don't know if there are other problems with pytorch 1.4, but thank you @bhaddow for this workaround. Can you let me know if the training is succesful?

bhaddow commented 4 years ago

Training is progressing and it is creating checkpoint models. The log just seems to be full of warnings though. Should I see validation scores?

I am using pytorch 1.4 too.

bhaddow commented 4 years ago

I found the validation scores. Doe this look normal?

| epoch 001 | valid on 'valid' subset | valid_loss 278.29 | valid_nll_loss 1.80495 | valid_ppl 3.49 | num_updates 1884
| epoch 002 | valid on 'valid' subset | valid_loss 281.825 | valid_nll_loss 1.83722 | valid_ppl 3.57 | num_updates 3768 | best 278.29
| epoch 003 | valid on 'valid' subset | valid_loss 274.323 | valid_nll_loss 1.75094 | valid_ppl 3.37 | num_updates 5652 | best 274.323

mattiadg commented 4 years ago

The validation scores don't look encouraging. Are you using a pretrained encoder?

bhaddow commented 4 years ago

I'm using the training command suggested here https://towardsdatascience.com/getting-started-with-end-to-end-speech-translation-3634c35a6561

mattiadg commented 4 years ago

Try to train it with ENglish target first, and then use this ASR model to pretrain the encoder, as explained in that blog.

Chaitanya-git commented 4 years ago

I have the same problem. This is happening in AudioDictionary.dummysentence() where (I think) the code is creating a dummy audio segment, and trying to initialise it with `uniform(self.nspecial + 1, len(self)) . This gives a uniform distribution between 5 and 4, and so hits the assertion - note that the AudioDictionary only has the 4 special symbols. Since (presumably) the symbols don't make sense for the AudioDictionary, I replaced with_uniform(0,1)`, and it continues without encountering the error above -- albeit with many other unrelated warnings.

This is what I thought too. The only difference is instead of trying to initialize the tesnor using uniform(0,1) I tried `uniform(self.nspecial, len(self))`

With that change, at the end of 100 epochs I get the following output:

| epoch 100 | valid on 'valid' subset | valid_loss 0.321441 | valid_nll_loss 0.00317835 | valid_ppl 1.00 | num_updates 53791 | best 0.3028

I did not use a pretrained encoder either

bhaddow commented 4 years ago

Thanks @mattiadg , I will look into it. Although I suspect I did something wrong in preprocessing, since it works OK for @Chaitanya-git without pre-training. Note that I am training en-es, but that shouldn't matter, right?

mattiadg commented 4 years ago

Actually, I'm not sure that the training of @Chaitanya-git was good. I've never seen a perplexity of 1 in a translation task. @Chaitanya-git can you please tell us how the tranlations look?

guillemcortes commented 4 years ago

Hi @mattiadg, I am trying with python 3.5.6 and pytorch 1.1.0 and I don't get the uniform_(self.nspecial + 1, len(self)) error but, I get some errors with python multiprocessing sempahores. Do you mind sharing which python and packages versions are you using? Thanks in advance. EDIT: I know that these semaphores errors can be related to memory problems, but I want to make sure I match your software requirements.

Chaitanya-git commented 4 years ago

Actually, I'm not sure that the training of @Chaitanya-git was good. I've never seen a perplexity of 1 in a translation task. @Chaitanya-git can you please tell us how the tranlations look?

You're right @mattiadg. The translations do look pretty bad. Most of the translations seem to be one word translations and I get a BLEU score of 0 using your guide. Although I initially thought I was using the script wrong, it does seem the translations are bad. I tried en-de translation BTW. And since I don't know German, I can't say for sure 😅

mattiadg commented 4 years ago

@guillemcortes I'm using python 3.6.4 with pytorch 1.1.0. Other packages: numpy 1.15.0, h5py 2.7.0, but I don't think that they really affect. I don't get errors or warnings during training. Only the warning that some segments were left out of tranining because too long.

@Chaitanya-git can you try with pytorch 1.1.0?

Chaitanya-git commented 4 years ago

@mattiadg , I ran the training script with pytorch 1.1.0 for a while and I was able to train for 7 epochs total. Here's what those results look like:

| epoch 007 | valid on 'valid' subset | valid_loss 257.965 | valid_nll_loss 1.62146 | valid_ppl 3.08 | num_updates 11578 | best 257.965

Does that look ok? Also, looks like the previous results I posted may be incorrect as I had made certain modifications to the criterion being used. With those modifications removed, the results look like this (with pytorch 1.4.0):

| epoch 077 | valid on 'valid' subset | valid_loss 247.787 | valid_nll_loss 1.48971 | valid_ppl 2.81 | num_updates 59281 | best 246.304

mattiadg commented 4 years ago

It now looks more normal to me

Il mer 15 apr 2020, 07:05 Chaitanya notifications@github.com ha scritto:

@mattiadg https://github.com/mattiadg , I ran the training script with pytorch 1.1.0 for a while and I was able to train for 7 epochs total. Here's what those results look like:

| epoch 007 | valid on 'valid' subset | valid_loss 257.965 | valid_nll_loss 1.62146 | valid_ppl 3.08 | num_updates 11578 | best 257.965

Does that look ok? Also, looks like the previous results I posted may be incorrect as I had made certain modifications to the criterion being used. With those modifications removed, the results look like this (with pytorch 1.4.0):

| epoch 077 | valid on 'valid' subset | valid_loss 247.787 | valid_nll_loss 1.48971 | valid_ppl 2.81 | num_updates 59281 | best 246.304

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/4#issuecomment-613818308, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIX4CBJCFIDXI3QBUVLRMU6A5ANCNFSM4ME5H4JQ .

Chaitanya-git commented 4 years ago

So pytorch 1.4 also works with the original workaround? Edit: The translations seem just as bad with the pytorch 1.4 model as before. I guess I'll have to wait for the pytorch 1.1 model to finish training

mattiadg commented 4 years ago

Did you try to translate?

Il gio 16 apr 2020, 06:58 Chaitanya notifications@github.com ha scritto:

So pytorch 1.4 also works with the original workaround?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/4#issuecomment-614416163, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIQLKF4FETZ7SLQCTL3RM2GANANCNFSM4ME5H4JQ .

Chaitanya-git commented 4 years ago

Yes, the same issue persists as before. The translations all very bad, consisting of the same output for all inputs

bhaddow commented 4 years ago

I am seeing the same issue with the translations. The output of the en-es system is nearly always one word (gracias). We will try the asr pretraining, but welcome any other suggestion.

Chaitanya-git commented 4 years ago

The issue persists with pytorch 1.1 as well. Is this expected without ASR pretraining? I'll try ASR pretraining as well and see how it goes

mattiadg commented 4 years ago

I'll run another training to be sure. From what you are saying, it looks like the data are not parallel, but this is strange.

Il gio 16 apr 2020, 14:25 Chaitanya notifications@github.com ha scritto:

The issue persists with pytorch 1.1 as well

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/4#issuecomment-614620082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIX2M7YQ5PLZS5LTWTDRM32MNANCNFSM4ME5H4JQ .

mattiadg commented 4 years ago

@Chaitanya-git @bhaddow how many gpus are you using for training? I think hat I forgot to mention it in the blog and wrote it only in the README, but those hyperparameters are for training with 4 GPUs. Sorry, if it wasn't clear before.

bhaddow commented 4 years ago

I was using a single GPU. If I increase update-freq, will that give me the equivalent effect to using more GPUs? (if slower). I could also increase the batch size, since training was only using around 6G of GPU memory.

mattiadg commented 4 years ago

I'm running a new training with 4 gpus on English target to check again that it works. I think that reaching a batch size of about 512 should be fine in any case, although using more gpus may be better.

Chaitanya-git commented 4 years ago

@mattiadg I have tried training with two and three GPUs. Could you tell me what parameters need to be changed as the number of GPUs change and how they vary with number of GPUs used?

mattiadg commented 4 years ago

My ASR training with 4 gpus is going normal. After 10 checkpoints the WER of the test set is 35%. Not exciting but definitely not random. I can show you my log:

` | distributed init (rank 0): tcp://localhost:10540 | distributed init (rank 3): tcp://localhost:10540 | distributed init (rank 2): tcp://localhost:10540 | distributed init (rank 1): tcp://localhost:10540 Namespace(adam_betas='(0.9, 0.999)', adam_eps=1e-08, arch='speechconvtransformer_big', attention_dropout=0.1, attn_2d=True, audio_input=True, bucket_cap_mb=150, clip_norm=20.0, criterion='label_smoothed_cross_entropy', data=['en-data/'], ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_ffn_embed_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, decoder_out_embed_dim=512, decoder_output_dim=512, device_id=0, distance_penalty='log', distributed_backend='nccl', distributed_init_host='localhost', distributed_init_method='tcp://localhost:10540', distributed_port=10541, distributed_rank=0, distributed_world_size=4, dropout=0.1, encoder_attention_heads=8, encoder_convolutions='[(64, 3, 3)] * 2', encoder_embed_dim=512, encoder_ffn_embed_dim=1024, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=True, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_window=None, freeze_encoder=False, init_variance=1.0, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.005], lr_scheduler='inverse_sqrt', lr_shrink=1.0, max_epoch=40, max_sentences=8, max_sentences_valid=8, max_source_positions=2000, max_target_positions=1000, max_tokens=12000, max_update=0, min_loss_scale=0.0001, min_lr=1e-08, momentum=0.99, no_attn_2d=False, no_cache_source=False, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, normalization_constant=1.0, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.1, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='test-en-github/', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=True, skip_invalid_size_inputs_valid_test=True, source_lang='npz', target_lang='en', task='translation', train_subset='train', update_freq=[16], upsample_primary=1, valid_subset='valid', validate_interval=1, warmup_init_lr=0.0003, warmup_updates=4000, weight_decay=0.0)

I thought that you may observe a lot of "Thank you" or "gracias" or whatever translation if you look only at the first translations. Fairseq translates after sorting the input by length, shorter sentences first. Can you confirm that you observe one-word translations all over the translation file?

@Chaitanya-git I think that it should be quite equivalent as long as you keep a batch size of 512.

bhaddow commented 4 years ago

I checked the output of my first run, and it's gracias all the way down.

I am running again (without pretraining) using 2 GPUs and doubling the update-freq to 32. With a max-tokens of 12000, this gives a batch size of around 500 sentences. fairseq is using dynamic batching, right? On this run, the valdi_ppl looks healthier - it's at 1.8 after 26 epochs - whereas with the small batch size it hardly moved at all. @guillemcortes is testing the asr training

Chaitanya-git commented 4 years ago

I just trained from scratch again with 4 GPUs for around 7 epochs and now the translations seem to be much better. Earlier the translations were indeed one word translations for the entire file. However, fail to understand exactly how to control the batch size to be around 512. I understand max-tokens, update-freq and max-sentences affect the batch size, but I fail to understand how the final value of 512 is obtained. Could @mattiadg or @bhaddow clarify this for me? Thanks!

bhaddow commented 4 years ago

I estimated the average batch size by dividing the number of sentences by the number of updates per epoch.

mattiadg commented 4 years ago

It's roughly max sentences times update freq times number of gpus.

Il ven 17 apr 2020, 13:24 Barry Haddow notifications@github.com ha scritto:

I estimated the average batch size by dividing the number of sentences by the number of updates per epoch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/4#issuecomment-615192572, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIWUEVNPMUC2KKECDJTRNA36JANCNFSM4ME5H4JQ .

Chaitanya-git commented 4 years ago

Alright, thanks!

Chaitanya-git commented 4 years ago

I trained a speechconvtransformer_paper model from scratch on 4 GPUs without ASR pretraining and I'm getting a BLEU score of only 0.33 after 80 epochs of training. Is that to be expected?

mattiadg commented 4 years ago

No, that is really strange. I'm running a training again on En-Es with and without pre-training. They started last night. With pre-training is much better, but tonight I can tell you something about the results.

mattiadg commented 4 years ago

Ok, @Chaitanya-git you were right. Also in my training the one without pre-training diverged early, while the other is going well

Chaitanya-git commented 4 years ago

Ok, so ASR pretraining isn't an optional step then as it previously seemed, as it seems to be the only way to get the model to converge.

mattiadg commented 4 years ago

It can converge also without pre training, but it requires another learning rate that needs to be found.

Il mar 21 apr 2020, 04:48 Chaitanya notifications@github.com ha scritto:

Ok, so ASR pretraining isn't an optional step then as it previously seemed, as it seems to be the only way to get the model to converge.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/4#issuecomment-616918821, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIURU727K555A7U3RFTRNUCOXANCNFSM4ME5H4JQ .

Giuseppe-Della-Corte commented 4 years ago

Should ASR pre-training be given "translation" in the "--task" parameter? Thanks in advance.

mattiadg commented 4 years ago

Yes, the task is translation. Language model is for using the decoder only. (I saw that you mentioned it in Medium).

Giuseppe-Della-Corte commented 4 years ago

Many thanks @mattiadg

mingboma commented 4 years ago

It can converge also without pre training, but it requires another learning rate that needs to be found. Il mar 21 apr 2020, 04:48 Chaitanya notifications@github.com ha scritto: … Ok, so ASR pretraining isn't an optional step then as it previously seemed, as it seems to be the only way to get the model to converge. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIURU727K555A7U3RFTRNUCOXANCNFSM4ME5H4JQ .

I also had the similar confusion about the pretrain. It seems that pretrain is a necessary step for this task, am I right about this? BTW, could you share the numbers for the translation quality without pretrain? Thanks!

zxshamson commented 3 years ago

Hi @mattiadg , do you mind sharing the training log (mainly the trend of train loss and valid loss) when training with and without ASR pretraining? And how many epochs do we need for convergence? I also cannot achieve reasonable results with the command.

mattiadg / FBK-Fairseq-ST

Command to reproduce results on MuST-C failing #4