kitzeslab / opensoundscape

Open source, scalable software for the analysis of bioacoustic recordings
http://opensoundscape.org
MIT License
140 stars 16 forks source link

Encountered RuntimeError when attempting to train CNN with parallel workers #1040

Closed ofsoundmind closed 1 month ago

ofsoundmind commented 3 months ago

~python3.9/multiprocessing/spawn.py on OSX 14.6.1

https://pytorch.org/docs/stable/notes/windows.html#multiprocessing-error-without-if-clause-protection

RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module:

   if __name__ == '__main__':
       freeze_support()
       ...

The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable.

The implementation of multiprocessing is different on Windows, which uses spawn instead of fork. So we have to wrap the code with an if-clause to protect the code from executing multiple times. Refactor your code into the following structure.

import torch

def main() for i, data in enumerate(dataloader):

do something here

if name == 'main': main()

sammlapp commented 3 months ago

Hi @ofsoundmind thanks for letting us know about this issue. Could you please provide some clarifying details to help us debug this issue? From your post, I can't tell what text you have copied from the error message versus from the thread you linked. Can you provide: (1) the lines of code you are running in your python script; (2) the full text of the error message; and (3) the version of python, opensoundscape and torch in your python environment?

Because you are on mac, I don't think that the thread about windows is relevant to your issue.

ofsoundmind commented 3 months ago

Hi @sammlapp. Thanks for your quick reply and sorry for missing out those details. I have posted the code I am working on and some test audio files on https://github.com/ofsoundmind/kihikihi.

python 3.9.19 opensoundscape 0.10.2 pytorch 2.3.1

Below is the error message I got when running code without "if name == "main":".

Training Epoch 0 0%| | 0/120 [00:00<?, ?it/s] Training Epoch 0 0%| | 0/120 [00:00<?, ?it/s] Traceback (most recent call last): File "", line 1, in File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/runpy.py", line 288, in run_path return _run_module_code(code, init_globals, run_name, File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/marchasenbank/Projects/kihikihi/preprocessing_and_model_train.py", line 67, in model.train( File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/site-packages/opensoundscape/ml/cnn.py", line 915, in train train_targets, train_scores = self._train_epoch( File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/site-packages/opensoundscape/ml/cnn.py", line 612, in _train_epoch for batch_idx, samples in enumerate( File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 439, in iter return self._get_iterator() File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 387, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1040, in init w.start() File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/spawn.py", line 154, in get_preparation_data _check_not_importing_main() File "/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/multiprocessing/spawn.py", line 134, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

If I include "if name == "main":" then code appears to run but returns the below warning for each worker instance:

/opt/homebrew/anaconda3/envs/kihikihi_env/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py:222: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1716905753886/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(1)

sammlapp commented 3 months ago

Thanks for the details, I will try to reproduce the error later today but haven't seen anything like this before. It's especially surprising considering that you have set num_workers = 0 (FYI, we typically train with at least num_workers=4, and ideally higher). With 0, it should only be using the root process so I'm not sure why its trying to spawn processes with multiprocessing and create the _MultiProcessingDataLoaderIter object.

@ofsoundmind can you confirm the value of num_workers that produced this error?

sammlapp commented 3 months ago

based on a few other threads, it does seem like wrapping the code that trains the model in if __name__ == '__main__': is the suggested solution here. In general, using this if block is a good practice and required in Windows systems; generally, it is used for the entire main script rather than just a few lines (for details see this post). The idea is to make sure that you only run the code once, in the main thread, even if other threads are started (e.g., for avoiding an infinite recursive loop of starting new threads).

I've inquired here in the PyTorch forums about the need to do this on Mac OS, and will wait for a reply before we make a change to our documentation.

thanks for reporting the behavior

ofsoundmind commented 3 months ago

Thanks for looking into the issue.

In the script I have used num_workers = 0 to allow the code to run without any issues. But changing the number of workers to 1 or above has been causing the errors for me.

I’ll have a look at the PyTorch forum you posted. Thanks again.

sammlapp commented 1 month ago

Since we haven't seen any response from PyTorch, we will consider it a best practice to us the if name=="__main__": block for all scripts, whether on Windows or Mac.