Llama3.2 3B 奇慢无比

Reminder

[X] I have read the README and searched the existing issues.

System Info

- `llamafactory` version: 0.9.1.dev0
- Platform: Linux-5.19.0-0_fbk12_zion_11583_g0bef9520ca2b-x86_64-with-glibc2.34
- Python version: 3.12.5
- PyTorch version: 2.4.1+cu121 (GPU)
- Transformers version: 4.44.2
- Datasets version: 2.21.0
- Accelerate version: 0.34.2
- PEFT version: 0.12.0
- TRL version: 0.9.6
- GPU type: NVIDIA H100
- DeepSpeed version: 0.15.1
- Bitsandbytes version: 0.43.3

Reproduction

cmd-line

CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train examples/train_full/my_config.yaml

my_config.yaml

### model
model_name_or_path: /data/users/dayuyang/dotsync-home/saved_models/llama3b

### method
stage: sft
do_predict: true
finetuning_type: full

### dataset
eval_dataset: my_data
template: llama3
cutoff_len: 4096
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/CRS/prompting/3b/result/
overwrite_output_dir: true

### eval
per_device_eval_batch_size: 1
predict_with_generate: true

Expected behavior

对比gemma2 27B 和 llama3.2 3B 在同一数据集的运行速度。

gemma2 27B 快非常多， 1s 一个， 3B模型 15s一个。

其他试过的有 Gemma2 2B，Llama3.1 8B，等，同样的config，就是model不一样，速度都很正常。

配置完全一样。理论上不是3B应该快很多吗？百思不得其解。会不会新的 llama 3.2支持有什么bug？

Others

No response

hiyouga / LLaMA-Factory

Llama3.2 3B 奇慢无比 #5598

Reminder

System Info

Reproduction

Expected behavior

Others