Open richardsun-voyager opened 2 months ago
The same issue!
I haven't seen this error before. Seems like you are hitting an empty batch randomly during the first epoch. Can you do:
export HYDRA_FULL_ERROR=1
and re-run to get a more full stack trace and post that here?
For me, the error occurred halfway during pre-training.
Validation DataLoader 1: 3%|▎ | 4/121 [00:00<00:20, 5.79it/s][A
Epoch 0: 11%|█▏ | 4401/38424 [19:32<2:31:05, 3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]
Validation DataLoader 1: 4%|▍ | 5/121 [00:00<00:19, 5.83it/s][A
Epoch 0: 11%|█▏ | 4402/38424 [19:32<2:31:04, 3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]
Validation DataLoader 1: 5%|▍ | 6/121 [00:01<00:19, 5.86it/s][A
Epoch 0: 11%|█▏ | 4403/38424 [19:32<2:31:03, 3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]
Validation DataLoader 1: 6%|▌ | 7/121 [00:01<00:19, 5.88it/s][A
Epoch 0: 11%|█▏ | 4404/38424 [19:33<2:31:02, 3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]wandb: Waiting for W&B process to finish... (failed 1).
wandb:
wandb: Run history:
wandb: epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: test/loss ▁
wandb: test/num_tokens ▁
wandb: test/perplexity ▁
wandb: timer/step █▁▂▁▁▁▁▂▂▁▁▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▂▁▂▂
wandb: timer/validation ▁█
wandb: trainer/epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: trainer/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: trainer/loss █▅▅▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▂▁▁▂▂▁▁▁▂▂▂▁▁▂
wandb: trainer/lr/pg1 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb: trainer/lr/pg2 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb: trainer/lr/pg3 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb: val/loss ▁
wandb: val/num_tokens ▁
wandb: val/perplexity ▁
wandb:
wandb: Run summary:
wandb: epoch 0
wandb: test/loss 1.1343
wandb: test/num_tokens 64487424
wandb: test/perplexity 3.10899
wandb: timer/step 0.26093
wandb: timer/validation 55.69303
wandb: trainer/epoch 0.0
wandb: trainer/global_step 3999
wandb: trainer/loss 1.14593
wandb: trainer/lr/pg1 0.00048
wandb: trainer/lr/pg2 0.00048
wandb: trainer/lr/pg3 0.00048
wandb: val/loss 1.14128
wandb: val/num_tokens 73400320
wandb: val/perplexity 3.13077
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4
wandb: Find logs at: ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4/logs
Error executing job with overrides: ['experiment=hg38/hg38', 'callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500', 'dataset.max_length=1024', 'dataset.batch_size=256', 'dataset.mlm=false', 'dataset.mlm_probability=0.0', 'dataset.rc_aug=true', 'model=hyena', 'model.d_model=256', 'model.n_layer=4', 'optimizer.lr=6e-4', 'train.global_batch_size=1024', 'trainer.max_steps=10000', 'trainer.devices=4', '+trainer.val_check_interval=2000', 'wandb.group=pretrain_hg38', 'wandb.name=hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4', 'wandb.mode=offline']
Traceback (most recent call last):
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/zhan8855/Caduceus/train.py", line 719, in <module>
main()
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/zhan8855/Caduceus/train.py", line 715, in main
train(config)
File "/home/zhan8855/Caduceus/train.py", line 680, in train
trainer.fit(model)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.on_advance_end()
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end
self._run_validation()
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation
self.val_loop.run()
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance
batch = next(data_fetcher)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
return self.fetching_function()
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch
batch = next(iterator)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
data = self._next_data()
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/_utils.py", line 722, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 7.
Original Traceback (most recent call last):
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 277, in default_collate
return collate(batch, collate_fn_map=default_collate_fn_map)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in collate
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in <listcomp>
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 121, in collate
return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 174, in collate_tensor_fn
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [1024] at entry 1
Not sure why you are encountering an empty batch. I would recommend perhaps adding a check / some printing / breakpoints in the HG38 dataset __getitem__
here to see if you are returning an empty batch and go from there
For me, the error occurred halfway during pre-training.
Validation DataLoader 1: 3%|▎ | 4/121 [00:00<00:20, 5.79it/s]�[A Epoch 0: 11%|█▏ | 4401/38424 [19:32<2:31:05, 3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110] Validation DataLoader 1: 4%|▍ | 5/121 [00:00<00:19, 5.83it/s]�[A Epoch 0: 11%|█▏ | 4402/38424 [19:32<2:31:04, 3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110] Validation DataLoader 1: 5%|▍ | 6/121 [00:01<00:19, 5.86it/s]�[A Epoch 0: 11%|█▏ | 4403/38424 [19:32<2:31:03, 3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110] Validation DataLoader 1: 6%|▌ | 7/121 [00:01<00:19, 5.88it/s]�[A Epoch 0: 11%|█▏ | 4404/38424 [19:33<2:31:02, 3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]wandb: Waiting for W&B process to finish... (failed 1). wandb: wandb: Run history: wandb: epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: test/loss ▁ wandb: test/num_tokens ▁ wandb: test/perplexity ▁ wandb: timer/step █▁▂▁▁▁▁▂▂▁▁▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▂▁▂▂ wandb: timer/validation ▁█ wandb: trainer/epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: trainer/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: trainer/loss █▅▅▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▂▁▁▂▂▁▁▁▂▂▂▁▁▂ wandb: trainer/lr/pg1 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆ wandb: trainer/lr/pg2 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆ wandb: trainer/lr/pg3 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆ wandb: val/loss ▁ wandb: val/num_tokens ▁ wandb: val/perplexity ▁ wandb: wandb: Run summary: wandb: epoch 0 wandb: test/loss 1.1343 wandb: test/num_tokens 64487424 wandb: test/perplexity 3.10899 wandb: timer/step 0.26093 wandb: timer/validation 55.69303 wandb: trainer/epoch 0.0 wandb: trainer/global_step 3999 wandb: trainer/loss 1.14593 wandb: trainer/lr/pg1 0.00048 wandb: trainer/lr/pg2 0.00048 wandb: trainer/lr/pg3 0.00048 wandb: val/loss 1.14128 wandb: val/num_tokens 73400320 wandb: val/perplexity 3.13077 wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4 wandb: Find logs at: ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4/logs Error executing job with overrides: ['experiment=hg38/hg38', 'callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500', 'dataset.max_length=1024', 'dataset.batch_size=256', 'dataset.mlm=false', 'dataset.mlm_probability=0.0', 'dataset.rc_aug=true', 'model=hyena', 'model.d_model=256', 'model.n_layer=4', 'optimizer.lr=6e-4', 'train.global_batch_size=1024', 'trainer.max_steps=10000', 'trainer.devices=4', '+trainer.val_check_interval=2000', 'wandb.group=pretrain_hg38', 'wandb.name=hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4', 'wandb.mode=offline'] Traceback (most recent call last): File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/zhan8855/Caduceus/train.py", line 719, in <module> main() File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main _run_hydra( File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in <lambda> lambda: hydra.run( File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run _ = ret.return_value File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "/home/zhan8855/Caduceus/train.py", line 715, in main train(config) File "/home/zhan8855/Caduceus/train.py", line 680, in train trainer.fit(model) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run results = self._run_stage() File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage self._run_train() File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train self.fit_loop.run() File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.on_advance_end() File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end self._run_validation() File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation self.val_loop.run() File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance batch = next(data_fetcher) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__ return self.fetching_function() File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function self._fetch_next_batch(self.dataloader_iter) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch batch = next(iterator) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__ data = self._next_data() File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/_utils.py", line 722, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 7. Original Traceback (most recent call last): File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 277, in default_collate return collate(batch, collate_fn_map=default_collate_fn_map) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in collate return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility. File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in <listcomp> return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility. File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 121, in collate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map) File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 174, in collate_tensor_fn return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [1024] at entry 1
Did you solve this problem?
Not yet... Currently, I blocked validation during pre-training.
I just want to report that I have also seen this problem in some runs but not all. I am using two different clusters to test, and it is only happening on the NCSA DELTA system. I have not yet figured out what is different in the runs where I see this problem.
@zhan8855 does it only happen in the validation step?
@leannmlindsey For me it is, and it seems that it most likely happens when the trainning loader and validation loader are running simultaneously.
The same issue!
i am arrch64
I am facing the same issue unfortunately. It happens during eval. Investigating now...
I have sloved this issue,The path to the file is '02_caduceus/src/dataloaders/datasets/hg38_dataset.py'. In this file, within the getitemfunction, there is an instance where the line of code seq = self.fasta(chr_name, start, end, max_length=self.max_length, i_shift=shift_idx, return_augs=self.return_augs) may sometimes assign an empty value, even though there is actually a result. A simple loop can be used to resolve this issue, for example: i = 0 while(len(seq) == 0 and i<100): seq= self.fasta(chr_name,start,end,max_length=self.max_length,i_shift=shift_idx,return_augs=self.return_augs) i+=1
I tried to reproduce the pre-training experiment using the command
But after the first epoch, there came an error saying:
Can you look into the issue? ThANKS!