Running tokenizer on dataset 速度逐渐变慢 - Githubissues

hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)

https://arxiv.org/abs/2403.13372

Apache License 2.0

33.55k stars 4.11k forks source link

Running tokenizer on dataset 速度逐渐变慢 #5443

Open xuyue1112 opened 1 month ago

xuyue1112 commented 1 month ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.120.bsk.2-amd64-x86_64-with-glibc2.31
Python version: 3.11.2
PyTorch version: 2.4.0 (GPU)
Transformers version: 4.45.0.dev0
Datasets version: 2.21.0
Accelerate version: 0.34.2
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A800-SXM4-40GB

Reproduction

dataset

dataset: xxx eval_dataset: xxx template: qwen2_vl cutoff_len: 4096 max_samples: 5000000 overwrite_cache: true preprocessing_num_workers: 16

Expected behavior

训练过程中，Running tokenizer on dataset 的速度逐渐从几百 samples/s 下降到个位数。请教下可能是哪里有问题？

Others

无

AlongWY commented 1 month ago

经过我的实际测试，#5458 应该解决了这个问题

Wiselnn570 commented 6 days ago

@AlongWY 我也遇到了同样的问题，但你这个应该是针对packing情况的，如果没有packing应该怎么改呢

经过我的实际测试，#5458 应该解决了这个问题

AlongWY commented 5 days ago

没有 packing 也会下降到个位数吗？按理说应该不会吧