Arch type - Githubissues

Bachstelze commented 4 years ago

I can't load the pretrained 32-lang-pairs-RAS-ckp - model with the tagged fairseq version 0.9.0:

| model transformer_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 243313664 (num. trained: 243313664)
| training on 1 GPUs
| max tokens per GPU = 2048 and max sentences per GPU = None
Traceback (most recent call last):
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/trainer.py", line 194, in load_checkpoint
    self.get_model().load_state_dict(
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/models/fairseq_model.py", line 71, in load_state_dict
    return super().load_state_dict(new_state_dict, strict)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1044, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerModel:
    size mismatch for encoder.embed_positions.weight: copying a param with shape torch.Size([302, 1024]) from checkpoint, the shape in current model is torch.Size([258, 1024]).
    size mismatch for decoder.embed_positions.weight: copying a param with shape torch.Size([302, 1024]) from checkpoint, the shape in current model is torch.Size([258, 1024]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/kalle/Sprachdaten/mRASP/train_environment/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq_cli/train.py", line 333, in cli_main
    main(args)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq_cli/train.py", line 70, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 115, in load_checkpoint
    extra_state = trainer.load_checkpoint(
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/trainer.py", line 202, in load_checkpoint
    raise Exception(
Exception: Cannot load model parameters from checkpoint /media/kalle/Sprachdaten/mRASP/checkpoint_best.pt; please ensure that the architectures match.

The model states itself as transformer_vaswani_wmt_en_de_big. Have there been changes to the architecture? Isn't the architecture compatible due to https://github.com/pytorch/fairseq/issues/2664 ?

Thanks for your promissing work!

PANXiao1994 commented 4 years ago

Hi,

Note the above log:

size mismatch for encoder.embed_positions.weight: copying a param with shape torch.Size([302, 1024]) from checkpoint, the shape in current model is torch.Size([258, 1024]). size mismatch for decoder.embed_positions.weight: copying a param with shape torch.Size([302, 1024]) from checkpoint, the shape in current model is torch.Size([258, 1024]).

Which means you should set --max-source-positions 300 --max-target-positions 300 during training

Bachstelze commented 4 years ago

Do i have to set them also during generation? Because i get this error after fine tuning:


Traceback (most recent call last):                                                                                                        
  File "/media/kalle/Sprachdaten/mRASP/train_environment/bin/fairseq-generate", line 8, in <module>
    sys.exit(cli_main())
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq_cli/generate.py", line 199, in cli_main
    main(args)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq_cli/generate.py", line 104, in main
    hypos = task.inference_step(generator, models, sample, prefix_tokens)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 265, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/sequence_generator.py", line 113, in generate
    return self._generate(model, sample, **kwargs)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/sequence_generator.py", line 376, in _generate
    cand_scores, cand_indices, cand_beams = self.search.step(
  File "/media/kalle/Sprachdaten/mRASP/train_environment/lib/python3.8/site-packages/fairseq/search.py", line 81, in step
    torch.div(self.indices_buf, vocab_size, out=self.beams_buf)
RuntimeError: Integer division of tensors using div or / is no longer supported, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.
``

linzehui commented 4 years ago

It seems that it is not the model loading problem. From the log you post, it might due to the Python3.8 issue. You may check whether there is empty line in the source you generate from. Or you may use Python<3.8 to check whether the problem is raised by Python3.8 .

linzehui / mRASP

Arch type #1