hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
33.52k stars 4.11k forks source link

Llama3.2 3B 奇慢无比 #5598

Closed dayuyang1999 closed 1 month ago

dayuyang1999 commented 1 month ago

Reminder

System Info

- `llamafactory` version: 0.9.1.dev0
- Platform: Linux-5.19.0-0_fbk12_zion_11583_g0bef9520ca2b-x86_64-with-glibc2.34
- Python version: 3.12.5
- PyTorch version: 2.4.1+cu121 (GPU)
- Transformers version: 4.44.2
- Datasets version: 2.21.0
- Accelerate version: 0.34.2
- PEFT version: 0.12.0
- TRL version: 0.9.6
- GPU type: NVIDIA H100
- DeepSpeed version: 0.15.1
- Bitsandbytes version: 0.43.3

Reproduction

cmd-line

CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train examples/train_full/my_config.yaml

my_config.yaml

### model
model_name_or_path: /data/users/dayuyang/dotsync-home/saved_models/llama3b

### method
stage: sft
do_predict: true
finetuning_type: full

### dataset
eval_dataset: my_data
template: llama3
cutoff_len: 4096
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/CRS/prompting/3b/result/
overwrite_output_dir: true

### eval
per_device_eval_batch_size: 1
predict_with_generate: true

Expected behavior

对比gemma2 27B 和 llama3.2 3B 在同一数据集的运行速度。

gemma2 27B 快非常多, 1s 一个, 3B模型 15s一个。

其他试过的有 Gemma2 2B,Llama3.1 8B,等,同样的config,就是model不一样,速度都很正常。

配置完全一样。理论上不是3B应该快很多吗?百思不得其解。会不会新的 llama 3.2支持 有什么bug?

Screenshot 2024-10-01 at 3 53 04 PM Screenshot 2024-10-01 at 3 52 44 PM Screenshot 2024-10-01 at 4 28 07 PM

Others

No response

hiyouga commented 1 month ago

使用 Instruct 模型