kuleshov-group / caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Apache License 2.0
137 stars 14 forks source link

Error when reproducing the pre-training #16

Open richardsun-voyager opened 2 months ago

richardsun-voyager commented 2 months ago

I tried to reproduce the pre-training experiment using the command

python -m train   experiment=hg38/hg38   callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500   dataset.max_length=1024   dataset.batch_size=1024   dataset.mlm=true   dataset.mlm_probability=0.15   dataset.rc_aug=false   model=caduceus   model.config.d_model=64   model.config.n_layer=1   model.config.bidirectional=true   model.config.bidirectional_strategy=add   model.config.bidirectional_weight_tie=true   model.config.rcps=true   optimizer.lr="8e-3"   train.global_batch_size=8   trainer.max_steps=10000   +trainer.val_check_interval=100   wandb=null

But after the first epoch, there came an error saying:

 File "/home/users/**/miniforge3/envs/caduceus_env/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 174, in collate_tensor_fn
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [1024] at entry 1

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 0:   0%|▏                                                                                               | 245/122081 [02:04<17:07:50,  1.98it/s, loss=1.23]

Can you look into the issue? ThANKS!

zhan8855 commented 2 months ago

The same issue!

yair-schiff commented 2 months ago

I haven't seen this error before. Seems like you are hitting an empty batch randomly during the first epoch. Can you do:

export HYDRA_FULL_ERROR=1

and re-run to get a more full stack trace and post that here?

zhan8855 commented 2 months ago

For me, the error occurred halfway during pre-training.

Validation DataLoader 1:   3%|▎         | 4/121 [00:00<00:20,  5.79it/s]
Epoch 0:  11%|█▏        | 4401/38424 [19:32<2:31:05,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   4%|▍         | 5/121 [00:00<00:19,  5.83it/s]
Epoch 0:  11%|█▏        | 4402/38424 [19:32<2:31:04,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   5%|▍         | 6/121 [00:01<00:19,  5.86it/s]
Epoch 0:  11%|█▏        | 4403/38424 [19:32<2:31:03,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   6%|▌         | 7/121 [00:01<00:19,  5.88it/s]
Epoch 0:  11%|█▏        | 4404/38424 [19:33<2:31:02,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]wandb: Waiting for W&B process to finish... (failed 1).
wandb: 
wandb: Run history:
wandb:               epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           test/loss ▁
wandb:     test/num_tokens ▁
wandb:     test/perplexity ▁
wandb:          timer/step █▁▂▁▁▁▁▂▂▁▁▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▂▁▂▂
wandb:    timer/validation ▁█
wandb:       trainer/epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: trainer/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:        trainer/loss █▅▅▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▂▁▁▂▂▁▁▁▂▂▂▁▁▂
wandb:      trainer/lr/pg1 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:      trainer/lr/pg2 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:      trainer/lr/pg3 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:            val/loss ▁
wandb:      val/num_tokens ▁
wandb:      val/perplexity ▁
wandb: 
wandb: Run summary:
wandb:               epoch 0
wandb:           test/loss 1.1343
wandb:     test/num_tokens 64487424
wandb:     test/perplexity 3.10899
wandb:          timer/step 0.26093
wandb:    timer/validation 55.69303
wandb:       trainer/epoch 0.0
wandb: trainer/global_step 3999
wandb:        trainer/loss 1.14593
wandb:      trainer/lr/pg1 0.00048
wandb:      trainer/lr/pg2 0.00048
wandb:      trainer/lr/pg3 0.00048
wandb:            val/loss 1.14128
wandb:      val/num_tokens 73400320
wandb:      val/perplexity 3.13077
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4
wandb: Find logs at: ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4/logs
Error executing job with overrides: ['experiment=hg38/hg38', 'callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500', 'dataset.max_length=1024', 'dataset.batch_size=256', 'dataset.mlm=false', 'dataset.mlm_probability=0.0', 'dataset.rc_aug=true', 'model=hyena', 'model.d_model=256', 'model.n_layer=4', 'optimizer.lr=6e-4', 'train.global_batch_size=1024', 'trainer.max_steps=10000', 'trainer.devices=4', '+trainer.val_check_interval=2000', 'wandb.group=pretrain_hg38', 'wandb.name=hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4', 'wandb.mode=offline']
Traceback (most recent call last):
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/zhan8855/Caduceus/train.py", line 719, in <module>
    main()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/zhan8855/Caduceus/train.py", line 715, in main
    train(config)
  File "/home/zhan8855/Caduceus/train.py", line 680, in train
    trainer.fit(model)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.on_advance_end()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end
    self._run_validation()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation
    self.val_loop.run()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance
    batch = next(data_fetcher)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
    return self.fetching_function()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch
    batch = next(iterator)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 7.
Original Traceback (most recent call last):
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 277, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 121, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 174, in collate_tensor_fn
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [1024] at entry 1
yair-schiff commented 2 months ago

Not sure why you are encountering an empty batch. I would recommend perhaps adding a check / some printing / breakpoints in the HG38 dataset __getitem__ here to see if you are returning an empty batch and go from there

richardsun-voyager commented 2 months ago

For me, the error occurred halfway during pre-training.

Validation DataLoader 1:   3%|▎         | 4/121 [00:00<00:20,  5.79it/s]�[A
Epoch 0:  11%|█▏        | 4401/38424 [19:32<2:31:05,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   4%|▍         | 5/121 [00:00<00:19,  5.83it/s]�[A
Epoch 0:  11%|█▏        | 4402/38424 [19:32<2:31:04,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   5%|▍         | 6/121 [00:01<00:19,  5.86it/s]�[A
Epoch 0:  11%|█▏        | 4403/38424 [19:32<2:31:03,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   6%|▌         | 7/121 [00:01<00:19,  5.88it/s]�[A
Epoch 0:  11%|█▏        | 4404/38424 [19:33<2:31:02,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]wandb: Waiting for W&B process to finish... (failed 1).
wandb: 
wandb: Run history:
wandb:               epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           test/loss ▁
wandb:     test/num_tokens ▁
wandb:     test/perplexity ▁
wandb:          timer/step █▁▂▁▁▁▁▂▂▁▁▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▂▁▂▂
wandb:    timer/validation ▁█
wandb:       trainer/epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: trainer/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:        trainer/loss █▅▅▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▂▁▁▂▂▁▁▁▂▂▂▁▁▂
wandb:      trainer/lr/pg1 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:      trainer/lr/pg2 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:      trainer/lr/pg3 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:            val/loss ▁
wandb:      val/num_tokens ▁
wandb:      val/perplexity ▁
wandb: 
wandb: Run summary:
wandb:               epoch 0
wandb:           test/loss 1.1343
wandb:     test/num_tokens 64487424
wandb:     test/perplexity 3.10899
wandb:          timer/step 0.26093
wandb:    timer/validation 55.69303
wandb:       trainer/epoch 0.0
wandb: trainer/global_step 3999
wandb:        trainer/loss 1.14593
wandb:      trainer/lr/pg1 0.00048
wandb:      trainer/lr/pg2 0.00048
wandb:      trainer/lr/pg3 0.00048
wandb:            val/loss 1.14128
wandb:      val/num_tokens 73400320
wandb:      val/perplexity 3.13077
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4
wandb: Find logs at: ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4/logs
Error executing job with overrides: ['experiment=hg38/hg38', 'callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500', 'dataset.max_length=1024', 'dataset.batch_size=256', 'dataset.mlm=false', 'dataset.mlm_probability=0.0', 'dataset.rc_aug=true', 'model=hyena', 'model.d_model=256', 'model.n_layer=4', 'optimizer.lr=6e-4', 'train.global_batch_size=1024', 'trainer.max_steps=10000', 'trainer.devices=4', '+trainer.val_check_interval=2000', 'wandb.group=pretrain_hg38', 'wandb.name=hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4', 'wandb.mode=offline']
Traceback (most recent call last):
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/zhan8855/Caduceus/train.py", line 719, in <module>
    main()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/zhan8855/Caduceus/train.py", line 715, in main
    train(config)
  File "/home/zhan8855/Caduceus/train.py", line 680, in train
    trainer.fit(model)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.on_advance_end()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end
    self._run_validation()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation
    self.val_loop.run()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance
    batch = next(data_fetcher)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
    return self.fetching_function()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch
    batch = next(iterator)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 7.
Original Traceback (most recent call last):
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 277, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 121, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 174, in collate_tensor_fn
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [1024] at entry 1

Did you solve this problem?

zhan8855 commented 2 months ago

Not yet... Currently, I blocked validation during pre-training.

leannmlindsey commented 2 months ago

I just want to report that I have also seen this problem in some runs but not all. I am using two different clusters to test, and it is only happening on the NCSA DELTA system. I have not yet figured out what is different in the runs where I see this problem.

leannmlindsey commented 2 months ago

@zhan8855 does it only happen in the validation step?

zhan8855 commented 2 months ago

@leannmlindsey For me it is, and it seems that it most likely happens when the trainning loader and validation loader are running simultaneously.

GengGengJiuXi commented 1 month ago

The same issue!

i am arrch64

smdrnks commented 1 month ago

I am facing the same issue unfortunately. It happens during eval. Investigating now...

GengGengJiuXi commented 1 month ago

I have sloved this issue,The path to the file is '02_caduceus/src/dataloaders/datasets/hg38_dataset.py'. In this file, within the getitemfunction, there is an instance where the line of code seq = self.fasta(chr_name, start, end, max_length=self.max_length, i_shift=shift_idx, return_augs=self.return_augs) may sometimes assign an empty value, even though there is actually a result. A simple loop can be used to resolve this issue, for example: i = 0 while(len(seq) == 0 and i<100): seq= self.fasta(chr_name,start,end,max_length=self.max_length,i_shift=shift_idx,return_augs=self.return_augs) i+=1