microsoft / AzureML-BERT

End-to-End recipes for pre-training and fine-tuning BERT using Azure Machine Learning Service
https://azure.microsoft.com/en-us/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/
MIT License
394 stars 127 forks source link

Tensor size doesn't match. #58

Open ghost opened 4 years ago

ghost commented 4 years ago

Hello!

We are working on custom corpus BERT pretraining. I followed the guide about data preparation (texts should be good now), however running the notebook gives the following error:

2 items cleaning up...
Cleanup took 0.0017843246459960938 seconds
06/28/2020 11:53:45 - INFO - __main__ -   Exiting context: ProjectPythonPath
Traceback (most recent call last):
  File "train.py", line 482, in <module>
    eval_loss = train(index)
  File "train.py", line 132, in train
    batch = next(dataloaders[dataset_type])
  File "train.py", line 47, in <genexpr>
    return (x for x in DataLoader(dataset, batch_size=train_batch_size // 2 if eval_set else train_batch_size,
  File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 615, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 144 and 128 in dimension 1 at /pytorch/aten/src/TH/generic/THTensorMoreMath.cpp:1307

which I don't understand perfectly in the current context.

We tried to run also with wiki en corpus data, still the same. Have tried with large-cased and multilingual-cased vocabs.