We are working on custom corpus BERT pretraining. I followed the guide about data preparation (texts should be good now), however running the notebook gives the following error:
2 items cleaning up...
Cleanup took 0.0017843246459960938 seconds
06/28/2020 11:53:45 - INFO - __main__ - Exiting context: ProjectPythonPath
Traceback (most recent call last):
File "train.py", line 482, in <module>
eval_loss = train(index)
File "train.py", line 132, in train
batch = next(dataloaders[dataset_type])
File "train.py", line 47, in <genexpr>
return (x for x in DataLoader(dataset, batch_size=train_batch_size // 2 if eval_set else train_batch_size,
File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 615, in __next__
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
return [default_collate(samples) for samples in transposed]
File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in <listcomp>
return [default_collate(samples) for samples in transposed]
File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 144 and 128 in dimension 1 at /pytorch/aten/src/TH/generic/THTensorMoreMath.cpp:1307
which I don't understand perfectly in the current context.
We tried to run also with wiki en corpus data, still the same. Have tried with large-cased and multilingual-cased vocabs.
Hello!
We are working on custom corpus BERT pretraining. I followed the guide about data preparation (texts should be good now), however running the notebook gives the following error:
which I don't understand perfectly in the current context.
We tried to run also with wiki en corpus data, still the same. Have tried with large-cased and multilingual-cased vocabs.