Closed zuishusheng closed 2 months ago
刚好我也遇到了这个问题,应该是Running tokenizer on dataset的时间太长了,然后多机通信等待时间太久(ddp_timeout参数)最后挂了,可能只能增大preprocessing_num_workers或者是预先处理数据?看了几个issue感觉只能这样
刚好我也遇到了这个问题,应该是Running tokenizer on dataset的时间太长了,然后多机通信等待时间太久(ddp_timeout参数)最后挂了,可能只能增大preprocessing_num_workers或者是预先处理数据?看了几个issue感觉只能这样
我用了4台H100,preprocessing_num_workers设置为512还是会有这个问题,在图片改小以后可以,应该是在处理超长的tokenizer的时候有问题
刚好我也遇到了这个问题,应该是Running tokenizer on dataset的时间太长了,然后多机通信等待时间太久(ddp_timeout参数)最后挂了,可能只能增大preprocessing_num_workers或者是预先处理数据?看了几个issue感觉只能这样
我用了4台H100,preprocessing_num_workers设置为512还是会有这个问题,在图片改小以后可以,应该是在处理超长的tokenizer的时候有问题
可能你的图片太大了,我8张v100训练llava,2w多条图文对,图片size比较小,preprocessing_num_workers设为128也很慢,你的量级应该比我大多了~,不过tokenizer多过程应该是只用了cpu没用到GPU,其实还有一个方法是设置数据集加载方式为streaming: true,但是实测这种方式训练到时候显卡利用率提不上来,基本上一直在加载当前iter的数据,或许只能看看像xtuner这些库是怎么处理的了
I got the problem when i tried to increase max_samples. Any idea to solve this
进一步debug这个问题,发现是在多进程处理数据的时候,有进程挂掉导致处理超时失败。
设置preprocessing_num_workers过大,很容易导致内存被挤爆,整个机器死机,只能重启。 通过修改进程的timeout时间并不能解决问题。
+1 same issue.
try removing the preprocessing_num_workers
argument
@hiyouga I already removed preprocessing_num_workers in the YAML file, but it works. But when I try increasing max_samples, the RAM overflows. I'm using qwen2vl_lora_sft.yaml."
^ Same for me. I tried to make it save to disk from time to time (in arrow files), but then I realised that even if I do that, you'd still expect to have the whole dataset loaded into RAM for training - is that really needed/necessarily?
I'm also trying to finetune qwen2 vl 7b.
@Mihaiii Do you have any solution for the problem?
@Mihaiii Do you have any solution for the problem?
Apparently there's a streaming param (https://huggingface.co/docs/datasets/v2.21.0/stream) that is made for this use case (meaning to load the dataset after the dataset in tokenized in chunks and save on disk as arraw files), but I erased my temp disk with the training data and gave up on my fine-tune project for the moment so i can't try it.
@huynhbaobk so I would first try to save in chunks (example generated by ChatGPT):
from datasets import Dataset
import os
# Assume 'raw_datasets' is your original dataset
# Directory to save the tokenized dataset in chunks
output_dir = "tokenized_dataset"
# Create directory if it doesn't exist
if not os.path.exists(output_dir):
os.makedirs(output_dir)
def tokenize_function(examples):
return tokenizer(examples['text'], truncation=True, padding=True)
# Process and save the dataset in chunks
for i in range(0, len(raw_datasets), 1000):
# Slice the dataset into chunks of 1000 samples
chunk = raw_datasets.select(range(i, min(i+1000, len(raw_datasets))))
# Tokenize the chunk
tokenized_chunk = chunk.map(tokenize_function, batched=True)
# Save the tokenized chunk to disk
tokenized_chunk.save_to_disk(os.path.join(output_dir, f"chunk_{i//1000}.arrow"))
Instead of this line (but, of course, keep the old map params): https://github.com/hiyouga/LLaMA-Factory/blob/c87023d539875cd8e622d40212a5627c9c182fb8/src/llamafactory/data/loader.py#L183
And then load the train dataset and eval dataset from disk with stream=True.
@Mihaiii LlamaFactory also supports streaming when you specify streaming: true
@Mihaiii LlamaFactory also supports streaming when you specify
streaming: true
Maybe some problem, when streaming mode used, dataset type is 'InterableDataset', but 'map' function has problem to resolve the sharegpt format json file. data /aligner.py
I solved the problem by set streaming: true
in config yaml.
And in file aligner.py , the function align_dataset
return have remove_columns = column_names, it removes the images
field in final dataset. So i tried to keep the images
columns by add line 210:
column_names = list(next(iter(dataset)).keys())
column_names = column_names.remove("images")
I solved the problem by set
streaming: true
in config yaml. And in file aligner.py , the functionalign_dataset
return have remove_columns = column_names, it removes theimages
field in final dataset. So i tried to keep theimages
columns by add line 210:column_names = list(next(iter(dataset)).keys()) column_names = column_names.remove("images")
when i set “column_names”,although it can continue process data, but still timeout , and it seems all data to be tokenizer before training. Do you have any idea?
I solved the problem by set
streaming: true
in config yaml. And in file aligner.py , the functionalign_dataset
return have remove_columns = column_names, it removes theimages
field in final dataset. So i tried to keep theimages
columns by add line 210:column_names = list(next(iter(dataset)).keys()) column_names = column_names.remove("images")
when i set “column_names”,although it can continue process data, but still timeout , and it seems all data to be tokenizer before training. Do you have any idea?
I have the same problem as you. I still don't know how to fix it. I can run model with small sample arround 1500 samples but it take a lot of RAM to load dataset. If I use streaming, it still stuck with error:
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 318, in forward
q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 204, in apply_rotary_pos_emb_vision
output = (tensor * cos) + (rotate_half(tensor) * sin)
RuntimeError: The size of tensor a (2) must match the size of tensor b (1512) at non-singleton dimension 1
@hiyouga could you help us?
fixed in #5346
Reminder
System Info
Reproduction
使用2048*2048的图片,总量3万个图文对,sharegpt格式的数据集。 设置preprocessing_num_workers=256 或者128/64等,都会在Running tokenizer on dataset的时候暂停,在长时间等待后出现One of the subprocesses hasabruptly died during map operation。
Expected behavior
No response
Others
No response