Open oldmikeyang opened 4 months ago
According to our previous benchmark experiment, export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=0
performs better on Xeon CPU + multi GPUs. The default value of SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS
is exactly 0, I think no need to modify script.
python/llm/example/GPU/Deepspeed-AutoTP/run_qwen_14b_arc_2_card.sh python/llm/example/GPU/Deepspeed-AutoTP/run_vicuna_33b_arc_2_card.sh python/llm/dev/benchmark/all-in-one/run-deepspeed-arc.sh
Current the following code only enable on the Intel Core CPU. But on Intel Xeon CPU, also need enable the SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS to improve performance.