raymondhs / fairseq-laser

My implementation of LASER architecture in Fairseq
MIT License
12 stars 6 forks source link

I run into error when loading a previously trained model. #2

Closed ever4244 closed 4 years ago

ever4244 commented 4 years ago

Is this codes still compatible with recent fairseq? I can train the model with the codes, (but with only one GPU card, 2 cards will cause an error) I also run into error when loading a previously trained model. I cannot produce embeddings because of that.

I also get a warning say the FairseqMultiModel is deprecated, is that the cause of the error?

raymondhs commented 4 years ago

Hi, I haven't tried it on the recent fairseq (and PyTorch). Could you let me know the error messages you are getting?

ever4244 commented 4 years ago

Hi, I haven't tried it on the recent fairseq (and PyTorch). Could you let me know the error messages you are getting?

Thanks for you reply:

Here is the script I used:

export CUDA_VISIBLE_DEVICES="1"
cat ./data-bin/valid.bpe.de-en_liwtest.de | python embed.py data-bin/iwslt17.de_fr.en.bpe16k/ \
  --task translation_laser --source-lang ${SRC} --target-lang en \
  --lang-pairs de-en,fr-en \
  --path checkpoints/laser_lstm5/checkpoint_best.pt \
  --buffer-size 2000 --batch-size 128 \
  --output-file iwslt17.test.${SRC}-en.${SRC}.enc \
  --user-dir $PWD/laser/

Here is the error message:

Namespace(beam=5, bpe=None, buffer_size=2000, cpu=False, criterion='cross_entropy', data='data-bin/iwslt17.de_fr.en.bpe16k/', dataset_impl=None, decoder_langtok=False, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, encoder_langtok=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', input='-', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, iter_decode_with_beam=1, iter_decode_with_external_reranker=False, lang_pairs='de-en,fr-en', lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', output_file='iwslt17.test.de-en.de.enc', path='checkpoints/laser_lstm5/checkpoint_best.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=False, raw_text=False, remove_bpe=None, replace_unk=None, required_batch_size_multiple=8, results_path=None, retain_iter_history=False, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang='de', spm_model=None, target_lang='en', task='translation_laser', temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/laser/', warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 13880 types
| [en] dictionary: 13880 types
| [fr] dictionary: 13880 types
| loading model(s) from checkpoints/laser_lstm5/checkpoint_best.pt
/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/models/fairseq_model.py:280: UserWarning: FairseqModel is deprecated, please use FairseqEncoderDecoderModel or BaseFairseqModel instead
  for key in self.keys
Traceback (most recent call last):
  File "embed.py", line 142, in <module>
    cli_main()
  File "embed.py", line 138, in cli_main
    main(args)
  File "embed.py", line 75, in main
    task=task,
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/checkpoint_utils.py", line 167, in load_model_ensemble
    ensemble, args, _task = load_model_ensemble_and_task(filenames, arg_overrides, task)
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/checkpoint_utils.py", line 186, in load_model_ensemble_and_task
    model.load_state_dict(state["model"], strict=True, args=args)
TypeError: load_state_dict() got an unexpected keyword argument 'args'

I check the model folder there is indeed trained models there. Also I cannot resume training as well, because the same model loading issue.

The training script I used:

export CUDA_VISIBLE_DEVICES="1"
fairseq-train data-bin/iwslt17.de_fr.en.bpe16k/ \
  --max-epoch 17 \
  --task translation_laser --lang-pairs de-en,fr-en \
  --arch laser_lstm_artetxe \
  --encoder-num-layers 5 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
  --lr 0.001 --lr-scheduler fixed \
  --weight-decay 0.0 --criterion cross_entropy \
  --save-dir checkpoints/laser_lstm5 \
  --update-freq 8 \
  --no-progress-bar --log-interval 50 \
  --user-dir $PWD/laser/ \
  --max-tokens 7584 \
  --ddp-backend=no_c10d \
| [en] dictionary: 13880 types
| [fr] dictionary: 13880 types
| loaded 1080 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.de-en.de
| loaded 1080 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.de-en.en
| data-bin/iwslt17.de_fr.en.bpe16k/ valid de-en 1080 examples
| loaded 1210 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.fr-en.fr
| loaded 1210 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.fr-en.en
| data-bin/iwslt17.de_fr.en.bpe16k/ valid fr-en 1210 examples
/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/models/fairseq_model.py:280: UserWarning: FairseqModel is deprecated, please use FairseqEncoderDecoderModel or BaseFairseqModel instead
  for key in self.keys
LaserLSTMModel(
  (models): ModuleDict(
    (de-en): FairseqModel(
      (encoder): LaserLSTMEncoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (lstm): LSTM(320, 512, num_layers=5, dropout=0.1, bidirectional=True)
      )
      (decoder): LaserLSTMDecoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (embed_lang): Embedding(4, 32, padding_idx=0)
        (lstm): LSTM(1376, 2048)
        (sentemb_hidden_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (sentemb_cell_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (fc_out): Linear(in_features=2048, out_features=13880, bias=True)
      )
    )
    (fr-en): FairseqModel(
      (encoder): LaserLSTMEncoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (lstm): LSTM(320, 512, num_layers=5, dropout=0.1, bidirectional=True)
      )
      (decoder): LaserLSTMDecoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (embed_lang): Embedding(4, 32, padding_idx=0)
        (lstm): LSTM(1376, 2048)
        (sentemb_hidden_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (sentemb_cell_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (fc_out): Linear(in_features=2048, out_features=13880, bias=True)
      )
    )
  )
)
| model laser_lstm_artetxe, criterion CrossEntropyCriterion
| num. model params: 98202296 (num. trained: 98202296)
| training on 1 GPUs
| max tokens per GPU = 7584 and max sentences per GPU = None
Traceback (most recent call last):
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/trainer.py", line 190, in load_checkpoint
    state["model"], strict=True, args=self.args
TypeError: load_state_dict() got an unexpected keyword argument 'args'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wei/anaconda3/bin/fairseq-train", line 11, in <module>
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq_cli/train.py", line 333, in cli_main
    main(args)
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq_cli/train.py", line 70, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/checkpoint_utils.py", line 115, in load_checkpoint
    reset_meters=args.reset_meters,
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/trainer.py", line 199, in load_checkpoint
    "please ensure that the architectures match.".format(filename)
Exception: Cannot load model parameters from checkpoint checkpoints/laser_lstm5/checkpoint_last.pt; please ensure that the architectures match.

The script I used to train it on 2 GPUS:

export CUDA_VISIBLE_DEVICES="0,1"
fairseq-train data-bin/iwslt17.de_fr.en.bpe16k/ \
  --max-epoch 17 \
  --task translation_laser --lang-pairs de-en,fr-en \
  --arch laser_lstm_artetxe \
  --encoder-num-layers 5 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
  --lr 0.001 --lr-scheduler fixed \
  --weight-decay 0.0 --criterion cross_entropy \
  --save-dir checkpoints/laser_lstm5_2gpu \
  --update-freq 8 \
  --no-progress-bar --log-interval 50 \
  --user-dir $PWD/laser/ \
  --max-tokens 7584 \
  --ddp-backend=no_c10d \

The error messages: (I tried to increase shared memory but the same error happens anyway.)

| distributed init (rank 1): tcp://localhost:11532
| distributed init (rank 0): tcp://localhost:11532
| initialized host wei-desktop as rank 1
| initialized host wei-desktop as rank 0
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, arch='laser_lstm_artetxe', best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=0.0, cpu=False, criterion='cross_entropy', curriculum=0, data='data-bin/iwslt17.de_fr.en.bpe16k/', dataset_impl=None, ddp_backend='no_c10d', decoder_dropout=0.1, decoder_embed_dim=320, decoder_hidden_dim=2048, decoder_langtok=False, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:11532', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=2, empty_cache_freq=0, encoder_dropout=0.1, encoder_embed_dim=320, encoder_hidden_dim=512, encoder_langtok=None, encoder_num_layers=5, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, lang_embed_dim=32, lang_embeddings=True, lang_pairs='de-en,fr-en', lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=50, lr=[0.001], lr_scheduler='fixed', lr_shrink=0.1, max_epoch=17, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=7584, max_tokens_valid=7584, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=True, no_save=False, no_save_optimizer_state=False, num_workers=1, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/laser_lstm5_2gpu', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation_laser', tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, train_subset='train', update_freq=[8], upsample_primary=1, use_bmuf=False, user_dir='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/laser/', valid_subset='valid', validate_interval=1, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 13880 types
| [en] dictionary: 13880 types
| [fr] dictionary: 13880 types
| loaded 1080 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.de-en.de
| loaded 1080 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.de-en.en
| data-bin/iwslt17.de_fr.en.bpe16k/ valid de-en 1080 examples
| loaded 1210 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.fr-en.fr
| loaded 1210 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.fr-en.en
| data-bin/iwslt17.de_fr.en.bpe16k/ valid fr-en 1210 examples
/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/models/fairseq_model.py:280: UserWarning: FairseqModel is deprecated, please use FairseqEncoderDecoderModel or BaseFairseqModel instead
  for key in self.keys
LaserLSTMModel(
  (models): ModuleDict(
    (de-en): FairseqModel(
      (encoder): LaserLSTMEncoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (lstm): LSTM(320, 512, num_layers=5, dropout=0.1, bidirectional=True)
      )
      (decoder): LaserLSTMDecoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (embed_lang): Embedding(4, 32, padding_idx=0)
        (lstm): LSTM(1376, 2048)
        (sentemb_hidden_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (sentemb_cell_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (fc_out): Linear(in_features=2048, out_features=13880, bias=True)
      )
    )
    (fr-en): FairseqModel(
      (encoder): LaserLSTMEncoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (lstm): LSTM(320, 512, num_layers=5, dropout=0.1, bidirectional=True)
      )
      (decoder): LaserLSTMDecoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (embed_lang): Embedding(4, 32, padding_idx=0)
        (lstm): LSTM(1376, 2048)
        (sentemb_hidden_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (sentemb_cell_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (fc_out): Linear(in_features=2048, out_features=13880, bias=True)
      )
    )
  )
)
| model laser_lstm_artetxe, criterion CrossEntropyCriterion
| num. model params: 98202296 (num. trained: 98202296)
/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/models/fairseq_model.py:280: UserWarning: FairseqModel is deprecated, please use FairseqEncoderDecoderModel or BaseFairseqModel instead
  for key in self.keys
| training on 2 GPUs
| max tokens per GPU = 7584 and max sentences per GPU = None
| no existing checkpoint found checkpoints/laser_lstm5_2gpu/checkpoint_last.pt
| loading train data for epoch 0
| loaded 209522 examples from: data-bin/iwslt17.de_fr.en.bpe16k/train.de-en.de
| loaded 209522 examples from: data-bin/iwslt17.de_fr.en.bpe16k/train.de-en.en
| data-bin/iwslt17.de_fr.en.bpe16k/ train de-en 209522 examples
| loaded 236653 examples from: data-bin/iwslt17.de_fr.en.bpe16k/train.fr-en.fr
| loaded 236653 examples from: data-bin/iwslt17.de_fr.en.bpe16k/train.fr-en.en
| data-bin/iwslt17.de_fr.en.bpe16k/ train fr-en 236653 examples
Traceback (most recent call last):
  File "/home/wei/anaconda3/bin/fairseq-train", line 11, in <module>
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq_cli/train.py", line 329, in cli_main
    nprocs=args.distributed_world_size,
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq_cli/train.py", line 296, in distributed_main
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq_cli/train.py", line 86, in main
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq_cli/train.py", line 126, in train
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/progress_bar.py", line 181, in __iter__
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/data/iterators.py", line 314, in __next__
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/data/iterators.py", line 43, in __next__
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/data/iterators.py", line 36, in __iter__
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 278, in __iter__
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 682, in __init__
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage
RuntimeError: unable to open shared memory object </torch_31370_1262781511> in read-write mode

So In conclusion, I currently can train it on a single card, but cannot load model.

Thank you for very much for helping me reviewing the issue!

raymondhs commented 4 years ago

So In conclusion, I currently can train it on a single card, but cannot load model.

I think this could be a similar issue to https://github.com/microsoft/MASS/tree/master/MASS-summarization#other-questions. Instead of using --user-dir, you can copy the files in laser into the corresponding directories in fairseq.

Also I cannot resume training as well, because the same model loading issue.

I just tried training and testing a model using the recent fairseq (0.9.0) and pytorch (1.3.1) and it seems to work fine. Did you train the model using a different fairseq version? In the older version, the load_state_dict method did not have the args keyword.

Could you try training a new model for a few steps (e.g. 1 epoch) and run embed.py to see if it works?

ever4244 commented 4 years ago

``

So In conclusion, I currently can train it on a single card, but cannot load model.

I think this could be a similar issue to https://github.com/microsoft/MASS/tree/master/MASS-summarization#other-questions. Instead of using --user-dir, you can copy the files in laser into the corresponding directories in fairseq.

Also I cannot resume training as well, because the same model loading issue.

I just tried training and testing a model using the recent fairseq (1.9.0) and pytorch (1.3.1) and it seems to work fine. Did you train the model using a different fairseq version? In the older version, the load_state_dict method did not have the args keyword.

Could you try training a new model for a few steps (e.g. 1 epoch) and run embed.py to see if it works?

Thanks!

' you can copy the files in laser into the corresponding directories in fairseq'

`

Do you mean I should copy the the laser folder into fairseq root folder, or I should copy them to the root/fairseq/tasks/ folder? or Should I copy the model file into model folder and task file to task folder.

` O, I see what you mean.

But the multi-GPU issue is not the most urgent. Do you think I would solve the loading problem by via the copy into folder solution?

My pytorch version is 1.3.1 and my fairseq version is 0.9.0 (your fairseq is 1.9.0?)

Do you mean I use my current setting to run 1 epoch or after I 'copy file into fairseq folder'?

raymondhs commented 4 years ago

Do you mean I should copy the the laser folder into fairseq root folder, or I should copy them to the root/fairseq/tasks/ folder? or Should I copy the model file into model folder and task file to task folder.

Copy each .py file in the laser folder to corresponding data, models, and tasks in the original fairseq. And yes you also need to edit the __init__.py to import the classes. Additionally, after that you may need to change the import statements on top of each .py file to absolute imports (e.g. instead of from .translation_laser use from fairseq.tasks.translation_laser, etc.).

My pytorch version is 1.3.1 and my fairseq version is 0.9.0 (your fairseq is 1.9.0?)

Sorry that was a typo and should be 0.9.0. Btw I just pushed an update for the method signature for load_state_dict. Please pull and try again to see if it works now. :)

Do you mean I use my current setting to run 1 epoch or after I 'copy file into fairseq folder'?

I suggest to try it first on 1 GPU with the current settings (with --user-dir). Train for a short time until one checkpoint is saved, then run embed.py on it (to save time instead of waiting for many epochs). If it works, then you can try to fix the training on multi-GPU. :)

ever4244 commented 4 years ago

Do you mean I should copy the the laser folder into fairseq root folder, or I should copy them to the root/fairseq/tasks/ folder? or Should I copy the model file into model folder and task file to task folder.

Copy each .py file in the laser folder to corresponding data, models, and tasks in the original fairseq. And yes you also need to edit the __init__.py to import the classes. Additionally, after that you may need to change the import statements on top of each .py file to absolute imports (e.g. instead of from .translation_laser use from fairseq.tasks.translation_laser, etc.).

My pytorch version is 1.3.1 and my fairseq version is 0.9.0 (your fairseq is 1.9.0?)

Sorry that was a typo and should be 0.9.0. Btw I just pushed an update for the method signature for load_state_dict. Please pull and try again to see if it works now. :)

Do you mean I use my current setting to run 1 epoch or after I 'copy file into fairseq folder'?

I suggest to try it first on 1 GPU with the current settings (with --user-dir). Train for a short time until one checkpoint is saved, then run embed.py on it (to save time instead of waiting for many epochs). If it works, then you can try to fix the training on multi-GPU. :)

Thank you very much! I can load the model and run the embedding now. About the multi GPU: It seems that my error report is different from those in https://github.com/microsoft/MASS/tree/master/MASS-summarization#other-questions

I copied task to task, data to data, and model to model (but I didn't modify the init.py as fairseq would already try load models under model folder thus I do not need to change init anymore if I put those three file into their respective folder.

But I still get the error report:

ash 
| distributed init (rank 1): tcp://localhost:19735
| distributed init (rank 0): tcp://localhost:19735
| initialized host wei-desktop as rank 1
| initialized host wei-desktop as rank 0
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, arch='laser_lstm_artetxe', best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=0.0, cpu=False, criterion='cross_entropy', curriculum=0, data='data-bin/iwslt17.de_fr.en.bpe16k/', dataset_impl=None, ddp_backend='no_c10d', decoder_dropout=0.1, decoder_embed_dim=320, decoder_hidden_dim=2048, decoder_langtok=False, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:19735', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=2, empty_cache_freq=0, encoder_dropout=0.1, encoder_embed_dim=320, encoder_hidden_dim=512, encoder_langtok=None, encoder_num_layers=5, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, lang_embed_dim=32, lang_embeddings=True, lang_pairs='de-en,fr-en', lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=50, lr=[0.001], lr_scheduler='fixed', lr_shrink=0.1, max_epoch=17, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=7584, max_tokens_valid=7584, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=True, no_save=False, no_save_optimizer_state=False, num_workers=1, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/laser_lstm5_newcodetest', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation_laser', tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, train_subset='train', update_freq=[8], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 13880 types
| [en] dictionary: 13880 types
| [fr] dictionary: 13880 types
| loaded 1080 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.de-en.de
| loaded 1080 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.de-en.en
| data-bin/iwslt17.de_fr.en.bpe16k/ valid de-en 1080 examples
| loaded 1210 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.fr-en.fr
| loaded 1210 examples from: data-bin/iwslt17.de_fr.en.bpe16k/valid.fr-en.en
| data-bin/iwslt17.de_fr.en.bpe16k/ valid fr-en 1210 examples
/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/models/fairseq_model.py:280: UserWarning: FairseqModel is deprecated, please use FairseqEncoderDecoderModel or BaseFairseqModel instead
  for key in self.keys
LaserLSTMModel(
  (models): ModuleDict(
    (de-en): FairseqModel(
      (encoder): LaserLSTMEncoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (lstm): LSTM(320, 512, num_layers=5, dropout=0.1, bidirectional=True)
      )
      (decoder): LaserLSTMDecoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (embed_lang): Embedding(4, 32, padding_idx=0)
        (lstm): LSTM(1376, 2048)
        (sentemb_hidden_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (sentemb_cell_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (fc_out): Linear(in_features=2048, out_features=13880, bias=True)
      )
    )
    (fr-en): FairseqModel(
      (encoder): LaserLSTMEncoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (lstm): LSTM(320, 512, num_layers=5, dropout=0.1, bidirectional=True)
      )
      (decoder): LaserLSTMDecoder(
        (embed_tokens): Embedding(13880, 320, padding_idx=1)
        (embed_lang): Embedding(4, 32, padding_idx=0)
        (lstm): LSTM(1376, 2048)
        (sentemb_hidden_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (sentemb_cell_proj): Linear(in_features=1024, out_features=2048, bias=True)
        (fc_out): Linear(in_features=2048, out_features=13880, bias=True)
      )
    )
  )
)
| model laser_lstm_artetxe, criterion CrossEntropyCriterion
| num. model params: 98202296 (num. trained: 98202296)
/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/models/fairseq_model.py:280: UserWarning: FairseqModel is deprecated, please use FairseqEncoderDecoderModel or BaseFairseqModel instead
  for key in self.keys
| training on 2 GPUs
| max tokens per GPU = 7584 and max sentences per GPU = None
| loaded checkpoint checkpoints/laser_lstm5_newcodetest/checkpoint_last.pt (epoch 2 @ 198 updates)
| loading train data for epoch 2
| loaded 209522 examples from: data-bin/iwslt17.de_fr.en.bpe16k/train.de-en.de
| loaded 209522 examples from: data-bin/iwslt17.de_fr.en.bpe16k/train.de-en.en
| data-bin/iwslt17.de_fr.en.bpe16k/ train de-en 209522 examples
| loaded 236653 examples from: data-bin/iwslt17.de_fr.en.bpe16k/train.fr-en.fr
| loaded 236653 examples from: data-bin/iwslt17.de_fr.en.bpe16k/train.fr-en.en
| data-bin/iwslt17.de_fr.en.bpe16k/ train fr-en 236653 examples
Traceback (most recent call last):
  File "/home/wei/anaconda3/bin/fairseq-train", line 11, in <module>
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq_cli/train.py", line 329, in cli_main
    nprocs=args.distributed_world_size,
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq_cli/train.py", line 296, in distributed_main
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq_cli/train.py", line 86, in main
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq_cli/train.py", line 126, in train
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/progress_bar.py", line 181, in __iter__
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/data/iterators.py", line 314, in __next__
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/data/iterators.py", line 43, in __next__
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/data/iterators.py", line 36, in __iter__
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 278, in __iter__
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 682, in __init__
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
  File "/home/wei/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
  File "/home/wei/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage
RuntimeError: unable to open shared memory object </torch_3358_4229974347> in read-write mode

I can put all the file into module folder and edit init.py file, do you think it would work? What is your current folder path setting to run on multi-GPU?

raymondhs commented 4 years ago

Hmm, it sounds like a different issue. Were you able to train a fairseq model (from the original repo) on multi-GPU? e.g. https://github.com/pytorch/fairseq/tree/master/examples/translation#multilingual-translation

What is your current folder path setting to run on multi-GPU?

fairseq/fairseq/data/laser_dataset.py fairseq/fairseq/models/laser_lstm.py fairseq/fairseq/tasks/translation_laser.py

This works for me, I only modified the import statements.

ever4244 commented 4 years ago

Hmm, it sounds like a different issue. Were you able to train a fairseq model (from the original repo) on multi-GPU? e.g. https://github.com/pytorch/fairseq/tree/master/examples/translation#multilingual-translation

What is your current folder path setting to run on multi-GPU?

fairseq/fairseq/data/laser_dataset.py fairseq/fairseq/models/laser_lstm.py fairseq/fairseq/tasks/translation_laser.py

This works for me, I only modified the import statements.

I can with this script using multilingual_transformer_iwslt_de_en model structure in fairseq

export CUDA_VISIBLE_DEVICES="0,1"
fairseq-train $data_root \
    --max-epoch 50 \
    --ddp-backend=no_c10d \
    --task multilingual_translation --lang-pairs en-es,es-en,es-en \
    --arch multilingual_transformer_iwslt_de_en \
    --share-decoders --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' \
    --lr 0.0005 --lr-scheduler inverse_sqrt --min-lr '1e-09' \
    --warmup-updates 4000 --warmup-init-lr '1e-07' \
    --label-smoothing 0.1 --criterion label_smoothed_cross_entropy \
    --dropout 0.3 --weight-decay 0.0001 \
    --save-dir checkpoints/multilingual_transformer_UNtest2GPU \
    --max-tokens 4000 \
    --update-freq 8
 model multilingual_transformer_iwslt_de_en, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 44762112 (num. trained: 44762112)
/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/models/fairseq_model.py:280: UserWarning: FairseqModel is deprecated, please use FairseqEncoderDecoderModel or BaseFairseqModel instead
  for key in self.keys
| training on 2 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
| no existing checkpoint found checkpoints/multilingual_transformer_UNtest2GPU/checkpoint_last.pt
| loading train data for epoch 0
| loaded 993 examples from: /home/wei/LIWEI_workspace/fairseq_liweimod/LASERtrain/bucc/data/UNv1.0.bpe40k-bin_sample/train.en-es.en
| loaded 993 examples from: /home/wei/LIWEI_workspace/fairseq_liweimod/LASERtrain/bucc/data/UNv1.0.bpe40k-bin_sample/train.en-es.es
| /home/wei/LIWEI_workspace/fairseq_liweimod/LASERtrain/bucc/data/UNv1.0.bpe40k-bin_sample train en-es 993 examples
| loaded 993 examples from: /home/wei/LIWEI_workspace/fairseq_liweimod/LASERtrain/bucc/data/UNv1.0.bpe40k-bin_sample/train.en-es.es
| loaded 993 examples from: /home/wei/LIWEI_workspace/fairseq_liweimod/LASERtrain/bucc/data/UNv1.0.bpe40k-bin_sample/train.en-es.en
| /home/wei/LIWEI_workspace/fairseq_liweimod/LASERtrain/bucc/data/UNv1.0.bpe40k-bin_sample train es-en 993 examples
| loaded 993 examples from: /home/wei/LIWEI_workspace/fairseq_liweimod/LASERtrain/bucc/data/UNv1.0.bpe40k-bin_sample/train.en-es.es
| loaded 993 examples from: /home/wei/LIWEI_workspace/fairseq_liweimod/LASERtrain/bucc/data/UNv1.0.bpe40k-bin_sample/train.en-es.en
| /home/wei/LIWEI_workspace/fairseq_liweimod/LASERtrain/bucc/data/UNv1.0.bpe40k-bin_sample train es-en 993 examples
| epoch 001:  50%|▌| 1/2 [00:03<00:03,  3.47s/it, loss=18.550, nll_loss=18.529, ppl=378285, wps=16493, ups=0, wpb=58254.000, bsz=1296.000, num_updates=1, lr=2.24975e-07, gnorm=3.049, clip=0.000, oom=0.000, wall=4, train_wall=3, en-es:loss=9.32515, en-es:nll_loss=9.32061, en-es:ntokens=30804, en-es:nsentences=648, en-es:sample_size=30804, es-en:loss=9.22454, es-en:nll_loss=9.20851, es-en:ntokens=27| epoch 001:  50%|▌| 1/2 [00:03<00:03,  3.47s/it, loss=18.550, nll_loss=18.529, ppl=378285, wps=16474, ups=0, wpb=58254.000, bsz=1296.000, num_updates=1, lr=2.24975e-07, gnorm=3.049, clip=0.000, oom=0.000, wall=4, train_wall=3, en-es:loss=9.32515, en-es:nll_loss=9.32061, en-es:ntokens=30804, en-es:nsentences=648, en-es:sample_size=30804, es-en:loss=9.22454, es-en:nll_loss=9.20851, es-en:ntokens=27| epoch 001: 100%|█| 2/2 [00:05<00:00,  3.03s/it, loss=18.560, nll_loss=18.541, ppl=381297, wps=21206, ups=1, wpb=50291.500, bsz=993.000, num_updates=2, lr=3.4995e-07, gnorm=3.092, clip=0.000, oom=0.000, wall=6, train_wall=5, en-es:loss=9.3252, en-es:nll_loss=9.32067, en-es:ntokens=26727.5, en-es:nsentences=496.5, en-es:sample_size=26727.5, es-en:loss=9.23473, es-en:nll_loss=9.21988, es-en:ntokens                                                                                                                                                                                                                                                                                                                                                                                                                | epoch 001 | loss 18.560 | nll_loss 18.541 | ppl 381297 | wps 21202 | ups 1 | wpb 50291.500 | bsz 993.000 | num_updates 2 | lr 3.4995e-07 | gnorm 3.092 | clip 0.000 | oom 0.000 | wall 6 | train_wall 5 | en-es:loss 9.3252 | en-es:nll_loss 9.32067 | en-es:ntokens 26727.5 | en-es:nsentences 496.5 | en-es:sample_size 26727.5 | es-en:loss 9.23473 | es-en:nll_loss 9.21988 | es-en:ntokens 23564 | es-en:nsentences 496.5 | es-en:sample_size 23564
| epoch 001: 100%|█| 2/2 [00:05<00:00,  3.03s/it, loss=18.560, nll_loss=18.541, ppl=381297, wps=21196, ups=1, wpb=50291.500, bsz=993.000, num_updates=2, lr=3.4995e-07, gnorm=3.092, clip=0.000, oom=0.000, wall=6, train_wall=5, en-es:loss=9.3252, en-es:nll_loss=9.32067, en-es:ntokens=26727.5, en-es:nsentences=496.5, en-es:sample_size=26727.5, es-en:loss=9.23473, es-en:nll_loss=9.21988, es-en:ntokens           
ever4244 commented 4 years ago

Hmm, it sounds like a different issue. Were you able to train a fairseq model (from the original repo) on multi-GPU? e.g. https://github.com/pytorch/fairseq/tree/master/examples/translation#multilingual-translation

What is your current folder path setting to run on multi-GPU?

fairseq/fairseq/data/laser_dataset.py fairseq/fairseq/models/laser_lstm.py fairseq/fairseq/tasks/translation_laser.py

This works for me, I only modified the import statements.

That is also my path setting. My import statement is modified as: from fairseq.data.laser_dataset import LaserDataset in translation_laser.py

And from fairseq.tasks.translation_laser import TranslationLaserTask in laser_lstm.py

I didn't change anything else. Since we both use fairseq 0.9.0 and pytorch 1.3.1 I can't figure out why this happens.

raymondhs commented 4 years ago

Yeah the changes are the same as mine. Could you try with smaller batch size, like --max-tokens 3584? What is the GPU memory size?

ever4244 commented 4 years ago

Yeah the changes are the same as mine. Could you try with smaller batch size, like --max-tokens 3584? What is the GPU memory size?

I tried 200 still does not work. My GPU is 2 1080ti 11G

ever4244 commented 4 years ago

Have you set --num-workers 0?

The issue is gone when I set --num-workers 0, but I thought it slow the training speed? For 1 card 2 epoch it cost around 17 min :30 seconds. For 2 card 2 epoch with --num-workers 0 it cost around 11 min

Is that normal for you?

raymondhs commented 4 years ago

No I didn't set --num-workers 0. The speed improvement is almost linear when I use multiple GPUs for training. This could be related to an issue in PyTorch or Fairseq.