ldzhangyx / instruct-MusicGen

The official implementation of our paper "Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning".
Apache License 2.0
47 stars 3 forks source link

Error when trying to recreate training on moisesDB #4

Open Saltb0xApps opened 1 week ago

Saltb0xApps commented 1 week ago

Hey! I cloned the repo, but i'm facing this error when trying to run training on moisesDB from scratch.

training command: python3 src/train.py trainer=gpu

Wandb link: https://wandb.ai/akhiltolani/instruct-musicgen/runs/6s3xo4to/logs?nw=nwuserakhiltolani

Main Error:

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 127, in collate
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 127, in <dictcomp>
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 138, in collate
    raise RuntimeError('each element in list of batch should be of equal size')
RuntimeError: each element in list of batch should be of equal size

Complete training logs, including the data loading part that is not on wandb -

Loading tracks info from provider moisesdb_v0.1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 240/240 [00:00<00:00, 14540.00it/s]
[2024-06-23 02:08:29,154][__main__][INFO] - [rank: 0] Instantiating model <src.models.instructmusicgenadapter_module.InstructMusicGenAdapterLitModule>
/home/akhil/.local/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
[2024-06-23 02:08:37,123][src.audiocraft.modules.conditioners][INFO] - T5 will be evaluated with autocast as float32
load....musicgen bk
lm_bk, here
[2024-06-23 02:08:42,798][__main__][INFO] - [rank: 0] Instantiating callbacks...
[2024-06-23 02:08:42,799][src.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.ModelCheckpoint>
[2024-06-23 02:08:42,804][src.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.EarlyStopping>
[2024-06-23 02:08:42,806][src.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.RichModelSummary>
[2024-06-23 02:08:42,806][src.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.RichProgressBar>
[2024-06-23 02:08:42,807][__main__][INFO] - [rank: 0] Instantiating loggers...
[2024-06-23 02:08:42,807][src.utils.instantiators][INFO] - [rank: 0] Instantiating logger <lightning.pytorch.loggers.wandb.WandbLogger>
[2024-06-23 02:08:42,957][__main__][INFO] - [rank: 0] Instantiating trainer <lightning.pytorch.trainer.Trainer>
/home/akhil/.local/lib/python3.9/site-packages/lightning/fabric/connector.py:571: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
Using 16bit Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[2024-06-23 02:08:43,109][__main__][INFO] - [rank: 0] Logging hyperparameters!
wandb: Currently logged in as: akhiltolani. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.17.2
wandb: Run data is saved locally in /home/akhil/instruct-MusicGen/logs/train/runs/2024-06-23_02-08-28/wandb/run-20240623_020844-6s3xo4to
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run gallant-valley-80
wandb: ⭐️ View project at https://wandb.ai/akhiltolani/instruct-musicgen
wandb: 🚀 View run at https://wandb.ai/akhiltolani/instruct-musicgen/runs/6s3xo4to
[2024-06-23 02:08:44,943][__main__][INFO] - [rank: 0] Starting training!
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃    ┃ Name                              ┃ Type             ┃ Params ┃ Mode  ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ 0  │ model                             │ Instructor       │  4.7 B │ train │
│ 1  │ model.peft_model                  │ CondMusicgen     │  4.5 B │ train │
│ 2  │ model.peft_model.lm               │ LMModel          │  4.5 B │ eval  │
│ 3  │ model.cp_transformer              │ CPTransformer    │  3.3 B │ train │
│ 4  │ model.cp_transformer.merge_linear │ ModuleList       │  201 M │ train │
│ 5  │ model.cp_transformer.layers       │ ModuleList       │  3.0 B │ train │
│ 6  │ criterion_1                       │ CrossEntropyLoss │      0 │ train │
│ 7  │ criterion_2                       │ L1Loss           │      0 │ train │
│ 8  │ train_acc                         │ Perplexity       │      0 │ train │
│ 9  │ val_acc                           │ Perplexity       │      0 │ train │
│ 10 │ test_acc                          │ Perplexity       │      0 │ train │
│ 11 │ train_loss                        │ MeanMetric       │      0 │ train │
│ 12 │ val_loss                          │ MeanMetric       │      0 │ train │
│ 13 │ test_loss                         │ MeanMetric       │      0 │ train │
│ 14 │ val_acc_best                      │ MinMetric        │      0 │ train │
└────┴───────────────────────────────────┴──────────────────┴────────┴───────┘
Trainable params: 251 M
Non-trainable params: 4.5 B
Total params: 4.7 B
Total estimated model params size (MB): 18.9 K

[2024-06-23 02:11:19,114][src.utils.utils][ERROR] - [rank: 0]
Traceback (most recent call last):
  File "/home/akhil/instruct-MusicGen/src/utils/utils.py", line 68, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "/home/akhil/instruct-MusicGen/src/train.py", line 92, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1028, in _run_stage
    self._run_sanity_check()
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1057, in _run_sanity_check
    val_loop.run()
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 128, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 127, in collate
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 127, in <dictcomp>
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 138, in collate
    raise RuntimeError('each element in list of batch should be of equal size')
RuntimeError: each element in list of batch should be of equal size

[2024-06-23 02:11:19,124][src.utils.utils][INFO] - [rank: 0] Output dir: /home/akhil/instruct-MusicGen/logs/train/runs/2024-06-23_02-08-28
[2024-06-23 02:11:19,124][src.utils.utils][INFO] - [rank: 0] Closing wandb!
wandb:
wandb: 🚀 View run gallant-valley-80 at: https://wandb.ai/akhiltolani/instruct-musicgen/runs/6s3xo4to
wandb: ⭐️ View project at: https://wandb.ai/akhiltolani/instruct-musicgen
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./logs/train/runs/2024-06-23_02-08-28/wandb/run-20240623_020844-6s3xo4to/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.
Error executing job with overrides: ['trainer=gpu']
Traceback (most recent call last):
  File "/home/akhil/instruct-MusicGen/src/train.py", line 125, in main
    metric_dict, _ = train(cfg)
  File "/home/akhil/instruct-MusicGen/src/utils/utils.py", line 78, in wrap
    raise ex
  File "/home/akhil/instruct-MusicGen/src/utils/utils.py", line 68, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "/home/akhil/instruct-MusicGen/src/train.py", line 92, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1028, in _run_stage
    self._run_sanity_check()
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1057, in _run_sanity_check
    val_loop.run()
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 128, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
  File "/home/akhil/.local/lib/python3.9/site-packages/lightning/pytorch/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 127, in collate
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 127, in <dictcomp>
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/home/akhil/.local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 138, in collate
    raise RuntimeError('each element in list of batch should be of equal size')
RuntimeError: each element in list of batch should be of equal size

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Steps to reproduce:

  1. clone repo
  2. download moisesdb and unzip to disk
  3. update hardcoded paths in repo
  4. run training command
Saltb0xApps commented 1 week ago

Update: training on slakh dataset works right off the box!

ldzhangyx commented 1 week ago

Thanks. I will look into this problem.