About multiple-thread attention computation on CPU using zero-inference example.

Hi,

I am trying to test the attention computation on the CPU with zero-interference.

I use the following command to run the script.

BSZ=96
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p  $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} --batch-size ${BSZ} --cpu-offload --pin-memory 1 --offload-dir /tmp/data/dxu --gen-len 32 --pin-memory 1 --kv-offload --async_kv_offload

Duing the exection, I only saw one core is used. Screenshot 2024-04-04 at 3 11 39 PM

However, there are many processes are created for the run.

Are there any other parameters or env I should configure to enable multiple-cores?

microsoft / DeepSpeedExamples

About multiple-thread attention computation on CPU using zero-inference example. #886