labmlai / annotated_deep_learning_paper_implementations

🧑‍🏫 60+ Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, sophia, ...), gans(cyclegan, stylegan2, ...), 🎮 reinforcement learning (ppo, dqn), capsnet, distillation, ... 🧠
https://nn.labml.ai
MIT License
53.57k stars 5.54k forks source link

RETRO: RuntimeError: stack expects each tensor to be equal size, but got [2, 32] at entry 0 and [1, 32] at entry 29 #135

Open mocarsha opened 2 years ago

mocarsha commented 2 years ago

Hi,

Running the exact code on github for deepmind's retrieval transformer - RETRO, getting the following error:

RuntimeError: stack expects each tensor to be equal size, but got [2, 32] at entry 0 and [1, 32] at entry 29

Could you please help me with this? I used the same dataset as in the code.

vpj commented 2 years ago

Can you please provide the full error?

Zahin112 commented 2 months ago

I am having same issue while running train.py. Here's the full detailed error:

Load data...[DONE] 2.39ms Tokenize...[DONE] 29.36ms Build vocabulary...[DONE] 0.62ms Load BERT tokenizer...[DONE] 340.26ms Load BERT model...[DONE] 882.21ms Load index...[DONE] 69.50ms 2024-06-25 11:59:59.603955: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-06-25 11:59:59.604002: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-06-25 11:59:59.605299: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-06-25 11:59:59.611551: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-06-25 12:00:00.750669: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT No labml server url specified. Please start a labml server and specify the URL. Docs: https://github.com/labmlai/labml/tree/master/app

retro_small: 706e157632ea11ef989a0242ac1c000c [clean]: "cleanup notebooks" 116: Train: 5% 88,760ms loss.train: 3.71168 88,760ms 0:00m/ 0:47m Traceback (most recent call last): File "/content/annotated_deep_learning_paper_implementations/labml_nn/transformers/retro/train.py", line 225, in train() File "/content/annotated_deep_learning_paper_implementations/labml_nn/transformers/retro/train.py", line 213, in train trainer() File "/content/annotated_deep_learning_paper_implementations/labml_nn/transformers/retro/train.py", line 134, in call for i, (src, tgt, neighbors) in monit.enum('Train', self.dataloader): File "/usr/local/lib/python3.10/dist-packages/labml/internal/monitor/iterator.py", line 84, in next next_value = next(self._iterator) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 675, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/dist-packages/labml_nn/transformers/retro/dataset.py", line 131, in getitem neighbors = torch.stack([torch.stack([self.tds.text_to_i(n) for n in chunks]) for chunks in s[2]]) RuntimeError: stack expects each tensor to be equal size, but got [2, 32] at entry 0 and [1, 32] at entry 31

Also, it says "No labml server url specified. Please start a labml server and specify the URL.". Do I need to create the server? is it required? can you explain please?