Closed zhangzef closed 2 weeks ago
here are my packages' version:
torch 2.4.0
tqdm 4.66.4
tokenizers 0.19.1
trl 0.9.6.dev0
transformers 4.41.2
python 3.10.14
I found that if there is a model processor in the processing of the map
function, dataset_num_proc>1
will cause the program to pause.
see the fix in #1914
see the fix in #1914
@kashif thank you for your reply! can you tell me why and how to fix it? thanks!
it should be fixed now... can you kindly test?
should I update the newest version?
yes
I updated the trl package to the latest version and this problem still occurs. This problem may have nothing to do with dpotrainer. It seems that the program will get stuck when the map
method includes the preprocessing of data by the processor.
just like this:
model_name = "./model_para/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = LlavaProcessor.from_pretrained(model_name)
def preprocess_function(examples):
images = [img.convert("RGB") for img in examples["image"]]
prompts = [f"<image>\nQuestion: {q}\nAnswer:" for q in examples["question"]]
chosen = examples["chosen"]
rejected = examples["rejected"]
inputs = tokenizer(prompts, images=images)
return {
"images": images,
"prompt": prompts,
"chosen": chosen,
"rejected": rejected
}
dataset = load_dataset('./datasets/RLAIF-V-Dataset')["train"].select(range(256))
train_dataset = dataset.map(
preprocess_function,
batched=True,
batch_size=128,
num_proc=os.cpu_count(),
)
I don't know if this is only on the llava processor, I haven't tested it on any other model.
can you try:
model_name = "./model_para/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = LlavaProcessor.from_pretrained(model_name)
def preprocess_function(examples, tokenizer):
images = [img.convert("RGB") for img in examples["image"]]
prompts = [f"<image>\nQuestion: {q}\nAnswer:" for q in examples["question"]]
chosen = examples["chosen"]
rejected = examples["rejected"]
inputs = tokenizer(prompts, images=images)
return {
"images": images,
"prompt": prompts,
"chosen": chosen,
"rejected": rejected
}
dataset = load_dataset('./datasets/RLAIF-V-Dataset')["train"].select(range(256))
train_dataset = dataset.map(
preprocess_function,
fn_kwargs={"tokenizer": tokenizer},
batched=True,
batch_size=128,
num_proc=os.cpu_count(),
)
also note some visual-LM processors do not support batch processing
It didn't seem to work, but thanks to your reminder, I found out the real reason was the llava processor, which worked fine when I replaced the processor with the CLIP processor.
When I set up just the
dataset_num_proc=2
process inDPOTrainer
, it seemed to completely pause the step of initializing the trainer map dataset, even though my dataset only had two data points and my cpu and memory utilization didn't seem to increase at all, my code:the log is:
It doesn't seem to start map no matter how long it waits.