intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
11 stars 3 forks source link

Training Interrupted in Multi-Processes Training #56

Open charlieJ107 opened 2 years ago

charlieJ107 commented 2 years ago

The steps to reproduce this issue are as follows:

  1. Prepare environment, follow the instruction in readme.
  2. source bigdl-nano-init
  3. Start training with ipex and multi-processes: /root/anaconda3/envs/ipex1.9/bin/python /data/analytics-zoo/python/nano/example/pytorch/semantic_segmentation/semantic_segmentation.py --data_path=/data/kitti_datasets/ --use_ipex --num_processes=4

After several epochs, the training process will be interrupted suddenly. The error message is as follows:

Epoch 35: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:47<00:00, 26.96s/it, loss=1.03, v_num=81]Traceback (most recent call last):
  File "/data/analytics-zoo/python/nano/example/pytorch/semantic_segmentation/semantic_segmentation.py", line 330, in <module>
    main(hparams)
  File "/data/analytics-zoo/python/nano/example/pytorch/semantic_segmentation/semantic_segmentation.py", line 314, in main
    trainer.fit(model)
  File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run
    self._dispatch()
  File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch
    self.accelerator.start_training(self)
  File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/bigdl/nano/pytorch/plugins/ddp_spawn.py", line 129, in start_training
    start_processes_new(self.new_process, **self.mp_spawn_kwargs)
  File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/bigdl/nano/pytorch/plugins/ddp_spawn.py", line 87, in start_processes_new
    while not context.join():
  File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
jason-dai commented 2 years ago

What if ipex is no used?

charlieJ107 commented 2 years ago

When ipex is not used, this problem does not appear, but no matter whether ipex 1.8.0 or 1.9.0 is used, this problem will exist in multi-process training.

However, because this example does not set the max_epoch parameter, we are not sure when this SIGKILL will appear. In the current test, this SIGKILL only exists in multi-process training using ipex, and it will appear in the first 150 epochs. It usually appears after the 35th epoch.