[Env] Add XFT_ENGINE env variable.

There are 2 methods to assign GPU cards to map into MPI processes.

Method 1) Assign Sequentially(XFT_ENGINE=GPU), like MPI process 0 maps to GPU card 0, MPI process 1 maps to GPU card 1...

XFT_ENGINE=GPU XFT_PIPELINE_STAGE=1 OMP_NUM_THREADS=12 ENABLE_CAT_MLP=1 mpirun \
    -n 1 numactl --all -C 48-59 -m 1 ./example --model /data/qwen-1_8b-xft/ --token /data/qwen-1_8b-hf/tokenizer_config.json --dtype fp16 --loop 1 --input_len 16 --output_len 16 : \
    -n 1 numactl --all -C 60-71 -m 1 ./example --model /data/qwen-1_8b-xft/ --token /data/qwen-1_8b-hf/tokenizer_config.json --dtype fp16 --loop 1 --input_len 16 --output_len 16 : \
    -n 1 numactl --all -C 72-83 -m 1 ./example --model /data/qwen-1_8b-xft/ --token /data/qwen-1_8b-hf/tokenizer_config.json --dtype fp16 --loop 1 --input_len 16 --output_len 16 : \
    -n 1 numactl --all -C 84-95 -m 1 ./example --model /data/qwen-1_8b-xft/ --token /data/qwen-1_8b-hf/tokenizer_config.json --dtype fp16 --loop 1 --input_len 16 --output_len 16

Method 2) Assign through the user define(-env XFT_ENGINE=GPU:<N>):

XFT_PIPELINE_STAGE=1 OMP_NUM_THREADS=12 ENABLE_CAT_MLP=1 mpirun \
    -n 1 -env XFT_ENGINE=GPU:0 numactl --all -C 48-59 -m 1 ./example --model /data/qwen-1_8b-xft/ --token /data/qwen-1_8b-hf/tokenizer_config.json --dtype fp16 --loop 1 --input_len 16 --output_len 16 : \
    -n 1 -env XFT_ENGINE=GPU:1 numactl --all -C 60-71 -m 1 ./example --model /data/qwen-1_8b-xft/ --token /data/qwen-1_8b-hf/tokenizer_config.json --dtype fp16 --loop 1 --input_len 16 --output_len 16 : \
    -n 1 -env XFT_ENGINE=GPU:2 numactl --all -C 72-83 -m 1 ./example --model /data/qwen-1_8b-xft/ --token /data/qwen-1_8b-hf/tokenizer_config.json --dtype fp16 --loop 1 --input_len 16 --output_len 16 : \
    -n 1 -env XFT_ENGINE=GPU:3 numactl --all -C 84-95 -m 1 ./example --model /data/qwen-1_8b-xft/ --token /data/qwen-1_8b-hf/tokenizer_config.json --dtype fp16 --loop 1 --input_len 16 --output_len 16

intel / xFasterTransformer

[Env] Add XFT_ENGINE env variable. #231