Running tokenizer on dataset 一直阻塞，然后subprocesses has abruptly died during map operation

zuishusheng commented 2 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

Snipaste_2024-08-30_01-20-17

Reproduction

使用2048*2048的图片，总量3万个图文对，sharegpt格式的数据集。设置preprocessing_num_workers=256 或者128/64等，都会在Running tokenizer on dataset的时候暂停，在长时间等待后出现One of the subprocesses hasabruptly died during map operation。

Expected behavior

No response

Others

No response

wwwbq commented 2 months ago

刚好我也遇到了这个问题，应该是Running tokenizer on dataset的时间太长了，然后多机通信等待时间太久（ddp_timeout参数）最后挂了，可能只能增大preprocessing_num_workers或者是预先处理数据？看了几个issue感觉只能这样

zuishusheng commented 2 months ago

刚好我也遇到了这个问题，应该是Running tokenizer on dataset的时间太长了，然后多机通信等待时间太久（ddp_timeout参数）最后挂了，可能只能增大preprocessing_num_workers或者是预先处理数据？看了几个issue感觉只能这样

我用了4台H100，preprocessing_num_workers设置为512还是会有这个问题，在图片改小以后可以，应该是在处理超长的tokenizer的时候有问题

wwwbq commented 2 months ago

刚好我也遇到了这个问题，应该是Running tokenizer on dataset的时间太长了，然后多机通信等待时间太久（ddp_timeout参数）最后挂了，可能只能增大preprocessing_num_workers或者是预先处理数据？看了几个issue感觉只能这样

我用了4台H100，preprocessing_num_workers设置为512还是会有这个问题，在图片改小以后可以，应该是在处理超长的tokenizer的时候有问题

可能你的图片太大了，我8张v100训练llava，2w多条图文对，图片size比较小，preprocessing_num_workers设为128也很慢，你的量级应该比我大多了～，不过tokenizer多过程应该是只用了cpu没用到GPU，其实还有一个方法是设置数据集加载方式为streaming: true，但是实测这种方式训练到时候显卡利用率提不上来，基本上一直在加载当前iter的数据，或许只能看看像xtuner这些库是怎么处理的了

huynhbaobk commented 2 months ago

I got the problem when i tried to increase max_samples. Any idea to solve this

zuishusheng commented 2 months ago

进一步debug这个问题，发现是在多进程处理数据的时候，有进程挂掉导致处理超时失败。 Snipaste_2024-08-31_04-33-48

设置preprocessing_num_workers过大，很容易导致内存被挤爆，整个机器死机，只能重启。通过修改进程的timeout时间并不能解决问题。

Mihaiii commented 2 months ago

+1 same issue.

hiyouga commented 2 months ago

try removing the preprocessing_num_workers argument

huynhbaobk commented 2 months ago

@hiyouga I already removed preprocessing_num_workers in the YAML file, but it works. But when I try increasing max_samples, the RAM overflows. I'm using qwen2vl_lora_sft.yaml."

Mihaiii commented 2 months ago

^ Same for me. I tried to make it save to disk from time to time (in arrow files), but then I realised that even if I do that, you'd still expect to have the whole dataset loaded into RAM for training - is that really needed/necessarily?

I'm also trying to finetune qwen2 vl 7b.

huynhbaobk commented 2 months ago

@Mihaiii Do you have any solution for the problem?

Mihaiii commented 2 months ago

@Mihaiii Do you have any solution for the problem?

Apparently there's a streaming param (https://huggingface.co/docs/datasets/v2.21.0/stream) that is made for this use case (meaning to load the dataset after the dataset in tokenized in chunks and save on disk as arraw files), but I erased my temp disk with the training data and gave up on my fine-tune project for the moment so i can't try it.

Mihaiii commented 2 months ago

@huynhbaobk so I would first try to save in chunks (example generated by ChatGPT):

from datasets import Dataset
import os

# Assume 'raw_datasets' is your original dataset

# Directory to save the tokenized dataset in chunks
output_dir = "tokenized_dataset"

# Create directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

# Process and save the dataset in chunks
for i in range(0, len(raw_datasets), 1000):
    # Slice the dataset into chunks of 1000 samples
    chunk = raw_datasets.select(range(i, min(i+1000, len(raw_datasets))))

    # Tokenize the chunk
    tokenized_chunk = chunk.map(tokenize_function, batched=True)

    # Save the tokenized chunk to disk
    tokenized_chunk.save_to_disk(os.path.join(output_dir, f"chunk_{i//1000}.arrow"))

Instead of this line (but, of course, keep the old map params): https://github.com/hiyouga/LLaMA-Factory/blob/c87023d539875cd8e622d40212a5627c9c182fb8/src/llamafactory/data/loader.py#L183

And then load the train dataset and eval dataset from disk with stream=True.

hiyouga commented 2 months ago

@Mihaiii LlamaFactory also supports streaming when you specify streaming: true

zuishusheng commented 2 months ago

@Mihaiii LlamaFactory also supports streaming when you specify streaming: true

Maybe some problem, when streaming mode used, dataset type is 'InterableDataset', but 'map' function has problem to resolve the sharegpt format json file. data /aligner.py

huynhbaobk commented 2 months ago

I solved the problem by set streaming: true in config yaml. And in file aligner.py , the function align_dataset return have remove_columns = column_names, it removes the images field in final dataset. So i tried to keep the images columns by add line 210:

column_names = list(next(iter(dataset)).keys())
column_names = column_names.remove("images")

zuishusheng commented 2 months ago

I solved the problem by set streaming: true in config yaml. And in file aligner.py , the function align_dataset return have remove_columns = column_names, it removes the images field in final dataset. So i tried to keep the images columns by add line 210:
column_names = list(next(iter(dataset)).keys())
column_names = column_names.remove("images")

when i set “column_names”，although it can continue process data, but still timeout , and it seems all data to be tokenizer before training. Snipaste_2024-09-02_16-17-25 Do you have any idea?

huynhbaobk commented 2 months ago

I solved the problem by set streaming: true in config yaml. And in file aligner.py , the function align_dataset return have remove_columns = column_names, it removes the images field in final dataset. So i tried to keep the images columns by add line 210:
column_names = list(next(iter(dataset)).keys())
column_names = column_names.remove("images")
when i set “column_names”，although it can continue process data, but still timeout , and it seems all data to be tokenizer before training. Do you have any idea?

I have the same problem as you. I still don't know how to fix it. I can run model with small sample arround 1500 samples but it take a lot of RAM to load dataset. If I use streaming, it still stuck with error:

File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 318, in forward
    q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 204, in apply_rotary_pos_emb_vision
    output = (tensor * cos) + (rotate_half(tensor) * sin)
RuntimeError: The size of tensor a (2) must match the size of tensor b (1512) at non-singleton dimension 1

@hiyouga could you help us?

hiyouga commented 2 months ago

fixed in #5346

hiyouga / LLaMA-Factory