torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu

I use transformers Trainer to finetune LLM by Distributed Data Parallel with ccl backend, when I use torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu, it will fail like above image. But when I use torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu, it worked well.

The script I used is https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/finetuning/instruction/finetune_clm.py, command is:

mpirun  --host 172.17.0.2,172.17.0.3 -n 2 -ppn 1 -genv OMP_NUM_THREADS=48 python3 finetune_clm.py     --model_name_or_path mosaicml/mpt-7b-chat     --train_file alpaca_data.json  --bf16 False     --output_dir ./mpt_peft_finetuned_model     --num_train_epochs 1     --max_steps 3     --per_device_train_batch_size 4     --per_device_eval_batch_size 4     --gradient_accumulation_steps 1     --evaluation_strategy "no"     --save_strategy "steps"   --save_steps 2000     --save_total_limit 1     --learning_rate 1e-4      --logging_steps 1     --peft lora     --group_by_length True     --dataset_concatenation     --do_train     --trust_remote_code True     --tokenizer_name "EleutherAI/gpt-neox-20b"     --use_fast_tokenizer True     --max_eval_samples 64     --no_cuda --ddp_backend ccl

Can you help investigate this issue?

oneapi-src / oneCCL

torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu #100