lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
956 stars 219 forks source link

Bug in CutPairsSampler with CutSet.from_files(): Fails to raise StopIteration at the end of dataset iteration, raises `AttributeError: 'tuple' object has no attribute 'subset'` #1396

Open Aijohc opened 2 months ago

Aijohc commented 2 months ago

Hello, and thank you for the excellent work with Lhotse's data management features!

I encountered a bug when using CutPairsSampler. When I load my source_cuts and target_cuts using CutSet.from_files() (with a list of .jsonl.gz files), the expected StopIteration exception is not raised correctly at the end of the dataset iteration. Instead, I encounter a different error:

W0920 21:49:06.145000 140064889530176 torch/multiprocessing/spawn.py:146] Terminating process [PID] via signal SIGTERM
W0920 21:49:06.146000 140064889530176 torch/multiprocessing/spawn.py:146] Terminating process [PID] via signal SIGTERM
W0920 21:49:06.148000 140064889530176 torch/multiprocessing/spawn.py:146] Terminating process [PID] via signal SIGTERM
Traceback (most recent call last):
  File "[PATH]/trainer.py", line 1186, in <module>
    main()
  File "[PATH]/trainer.py", line 1177, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "[PATH]/site-packages/torch/multiprocessing/spawn.py", line 282, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "[PATH]/site-packages/torch/multiprocessing/spawn.py", line 238, in start_processes
    while not context.join():
  File "[PATH]/site-packages/torch/multiprocessing/spawn.py", line 189, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "[PATH]/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
    fn(i, *args)
  File "[PATH]/trainer.py", line 1067, in run
    train_one_epoch(
  File "[PATH]/trainer.py", line 822, in train_one_epoch
    valid_info = compute_validation_loss(
  File "[PATH]/trainer.py", line 558, in compute_validation_loss
    for batch_idx, batch in enumerate(valid_dl):
  File "[PATH]/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "[PATH]/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
    return self._process_data(data)
  File "[PATH]/site-packages/torch/utils/data/dataloader.py", line 1368, in _process_data
    self._try_put_index()
  File "[PATH]/site-packages/torch/utils/data/dataloader.py", line 1350, in _try_put_index
    index = self._next_index()
  File "[PATH]/site-packages/torch/utils/data/dataloader.py", line 620, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "[PATH]/site-packages/lhotse/dataset/sampling/base.py", line 323, in __next__
    combined = combined + combined.subset(first=diff).modify_ids(
AttributeError: 'tuple' object has no attribute 'subset'

CutParisSampler

train_sampler = CutPairsSampler(
    cuts_train[0],
    cuts_train[1],
    max_target_duration=40,
    shuffle=True,
    drop_last=False,
    seed=self.args.seed,
)

Cuts

def train_cuts(self) -> Tuple[CutSet]:
    logging.info("About to get train cuts")
    prompt_files = list(
        sorted(self.args.manifest_dir.glob("train/*.cuts.prompts.jsonl.gz"))
    )
    target_files = list(
        sorted(self.args.manifest_dir.glob("train/*.cuts.targets.jsonl.gz"))
    )
    prompts = CutSet.from_files(prompt_files + target_files, shuffle_iters=False)
    targets = CutSet.from_files(target_files + prompt_files, shuffle_iters=False)
    return prompts, targets

lhotse version: 1.26.0

I believe this could be an issue with how the end of the dataset is handled when iterating over CutPairsSampler. Could you please investigate this?

Thanks again for your hard work!

Additional Question:

I also have a question regarding the CutPairsSampler. Is it possible to specify parameters like buffer_size and quadratic_duration similar to the DynamicBucketingSampler? These parameters are very important when working with the DynamicBucketingSampler, and I noticed they are not directly available in CutPairsSampler. Could you consider supporting such parameters?

Thank you!

pzelasko commented 1 month ago

Regarding the first issue it looks like I haven't updated CutPairsSampler properly with latest changes. I'll take a look.

Regarding the other question, You might want to use DynamicCutSampler or DynamicBucketingSampler instead; if you give them more than one CutSet, they act as CutPairsSampler (and support triples, quadruples, and so on as well). In fact CutPairsSampler should be deprecated at this point.

Aijohc commented 1 month ago

Thank you for your answer! So, does that mean DynamicCutSampler can completely replace CutPairsSampler? I will give it a try. Thanks!

pzelasko commented 1 month ago

Yes, it can.