Got Error when use Lhotse Shar data format for multi-GPU K2 model training

pengyizhou commented 1 day ago

Hi! Recently, we have been training a large-scale dataset (> 50M audio segments) on the K2 platform. To reduce IO operations through NFS, we decided to use the Lhotse Shar format.

I followed the instructions from https://github.com/lhotse-speech/lhotse/blob/master/examples/04-lhotse-shar.ipynb It worked well when I used one GPU with one or multiple workers. However, I got an unexpected error when I trained the model using multi-GPUs.

Here is the error information:

Traceback (most recent call last):
  File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1899, in <module>
    main()
  File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1890, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1727, in run
    train_one_epoch(
  File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1282, in train_one_epoch
    for batch_idx, batch in enumerate(train_dl):
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
    return self._get_iterator()
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1034, in __init__
    w.start()
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'generator' object

It looks like a generator from train_dl is used in the DDP processes. Then, I tried to debug it and found the only way to make it work was by setting num_workers=0.

I am using shar format with fields of "cuts" and "features". Number of GPU=4. Has anyone had similar issues?

pzelasko commented 19 hours ago

You might just get lucky if you set lhotse.set_dill_enabled(True) somewhere in your code (or LHOTSE_DILL_ENABLED=1). Otherwise you'll have to try and debug where this generator object is created. I don't think I've ever run into this issue with Lhotse so I suspect it may be somewhere in the user code (typicall some .map or .filter method called on lhotse objects with a lambda function etc).

I used this snippet in the past to find the unpicklable objects: https://gist.github.com/pzelasko/90c1c13acd86f6c9c0aa4a3fa69dadba

pengyizhou commented 16 hours ago

You might just get lucky if you set lhotse.set_dill_enabled(True) somewhere in your code (or LHOTSE_DILL_ENABLED=1). Otherwise you'll have to try and debug where this generator object is created. I don't think I've ever run into this issue with Lhotse so I suspect it may be somewhere in the user code (typicall some .map or .filter method called on lhotse objects with a lambda function etc).

I used this snippet in the past to find the unpicklable objects: https://gist.github.com/pzelasko/90c1c13acd86f6c9c0aa4a3fa69dadba

Thank you very much for your reply! I saw in the lhotse codebase that if set_dill_enabled(True) was called, it would set an env var "LHOTSE_DILL_ENABLED"=1. And also dill package would be checked whether it was installed. I have not installed dill package. So I tried to print this var during the training. However, the output was None. So I believe the error was caused by some other issues. I will try to debug further.

lhotse-speech / lhotse

Got Error when use Lhotse Shar data format for multi-GPU K2 model training #1408