Open charlieJ107 opened 2 years ago
What if ipex is no used?
When ipex is not used, this problem does not appear, but no matter whether ipex 1.8.0 or 1.9.0 is used, this problem will exist in multi-process training.
However, because this example does not set the max_epoch parameter, we are not sure when this SIGKILL will appear. In the current test, this SIGKILL only exists in multi-process training using ipex, and it will appear in the first 150 epochs. It usually appears after the 35th epoch.
The steps to reproduce this issue are as follows:
source bigdl-nano-init
/root/anaconda3/envs/ipex1.9/bin/python /data/analytics-zoo/python/nano/example/pytorch/semantic_segmentation/semantic_segmentation.py --data_path=/data/kitti_datasets/ --use_ipex --num_processes=4
After several epochs, the training process will be interrupted suddenly. The error message is as follows: