kuleshov-group / caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Apache License 2.0
137 stars 14 forks source link

Unexpected data loader error during pre-training #20

Closed zhan8855 closed 2 months ago

zhan8855 commented 2 months ago

Hi, thank you for your awesome work!

When I tried to run run_pretrain_hyena.sh, I got an unexpected error during validation. It seems that there is something wrong with the fetcher, which sometimes returns empty tensor. The full error log is:

Error executing job with overrides: ['experiment=hg38/hg38', 'callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500', 'dataset.max_length=1024', 'dataset.batch_size=256', 'dataset.mlm=false', 'dataset.mlm_probability=0.0', 'dataset.rc_aug=true', 'model=hyena', 'model.d_model=256', 'model.n_layer=4', 'optimizer.lr=6e-4', 'train.global_batch_size=1024', 'trainer.max_steps=10000', 'trainer.devices=4', '+trainer.val_check_interval=2000', 'wandb.group=pretrain_hg38', 'wandb.name=hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4_step-10000', 'wandb.mode=offline']
Traceback (most recent call last):
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/zhan8855/Caduceus/train.py", line 720, in <module>
    main()
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/zhan8855/Caduceus/train.py", line 716, in main
    train(config)
  File "/home/zhan8855/Caduceus/train.py", line 681, in train
    trainer.fit(model)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.on_advance_end()
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end
    self._run_validation()
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation
    self.val_loop.run()
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance
    batch = next(data_fetcher)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
    return self.fetching_function()
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch
    batch = next(iterator)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 25.
Original Traceback (most recent call last):
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 277, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 121, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/home/zhan8855/scratch/caduceus_env/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 174, in collate_tensor_fn
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [1024] at entry 1
yair-schiff commented 2 months ago

I have unfortunately not encountered this error before.

Which pytorch-lightning version are you using?

zhan8855 commented 2 months ago

I am using pytorch-lightning 1.8.6, pip version, following caduceus_env.yml.

yair-schiff commented 2 months ago

Closing this as duplicate of https://github.com/kuleshov-group/caduceus/issues/16