Open pretbc opened 2 weeks ago
It looks like you're working on a multi-turn conversation dataset and encountered an error related to the number of image tags not matching the number of images. Let's break down the issue and see how we can resolve it.
The error message indicates that the number of unique image tags does not match the number of images provided. Specifically, you have 1 image tag but 2 images. This discrepancy is causing the assertion to fail.
Check Image Tag Generation: Ensure that the number of image tags generated matches the number of images. In your _get_inputs
method, the image_tag_text
should have as many tags as there are images.
Verify Input Data: Make sure that each conversation turn in your dataset has the correct number of images and corresponding tags.
Here's a modified version of your _get_inputs
method to ensure the number of image tags matches the number of images:
def _get_inputs(self, user_text, image_paths):
images = [Image.open(self.image_dir / image_path) for image_path in image_paths]
image_tag_text = ''.join([f'<|image_{i+1}|>' for i in range(len(images))])
prompt_message = {'role': 'user', 'content': f'{image_tag_text}\n{user_text}'}
prompt = self.processor.tokenizer.apply_chat_template(
[prompt_message], tokenize=False, add_generation_prompt=True
)
inputs = self.processor(prompt, images, return_tensors='pt')
return inputs
To work with multi-turn conversations in your dataset, you need to ensure that each turn is processed correctly and that the inputs and outputs are aligned. Here's a general approach:
Dataset Structure: Your dataset should have a structure where each example contains multiple conversation turns, with each turn having user input, assistant response, and any associated images.
Processing Each Turn: For each turn, generate the appropriate input IDs and labels. Concatenate these for all turns in an example.
Handling Images: Ensure that the number of image tags matches the number of images for each turn.
Here's a simplified version of your dataset class with the necessary adjustments:
class Phi3VDataset(Dataset):
def __init__(self, jsonl_file: str, image_dir: str, processor):
self.image_dir = Path(image_dir)
with open(jsonl_file) as f:
self.examples = [json.loads(line) for line in f]
self.processor = processor
def __len__(self):
return len(self.examples)
def shard(self, num_shards, shard_id):
num_data = len(self.examples)
sharded = copy.deepcopy(self)
sharded.examples = [self.examples[i] for i in range(shard_id, num_data, num_shards)]
return sharded
def _get_inputs(self, user_text, image_paths):
images = [Image.open(self.image_dir / image_path) for image_path in image_paths]
image_tag_text = ''.join([f'<|image_{i+1}|>' for i in range(len(images))])
prompt_message = {'role': 'user', 'content': f'{image_tag_text}\n{user_text}'}
prompt = self.processor.tokenizer.apply_chat_template(
[prompt_message], tokenize=False, add_generation_prompt=True
)
inputs = self.processor(prompt, images, return_tensors='pt')
return inputs
def __getitem__(self, idx):
example = self.examples[idx]
all_input_ids = []
all_labels = []
all_pixel_values = []
all_image_sizes = []
for turn in example['conversations']:
inputs = self._get_inputs(turn['user'], turn['images'])
prompt_input_ids = inputs['input_ids']
assistant_text = turn['assistant']
response = f'{assistant_text}<|end|>\n
ty for your answer @leestott.
I think the main misunderstandings is in for loop:
for turn in example['conversations']:
inputs = self._get_inputs(turn['user'], turn['images'])
prompt_input_ids = inputs['input_ids']
assistant_text = turn['assistant']
response = f'{assistant_text}<|end|>\n<|endoftext|>'
# Do not add bos token to answer
response_input_ids = self.processor.tokenizer(
response, add_special_tokens=False, return_tensors='pt'
)['input_ids']
input_ids = torch.cat([prompt_input_ids, response_input_ids], dim=1).squeeze(0)
labels = torch.cat(
[
torch.tensor([IGNORE_INDEX] * len(prompt_input_ids[0])),
response_input_ids.squeeze(0),
],
dim=0,
)
all_input_ids.append(input_ids)
all_labels.append(labels)
all_pixel_values.append(inputs['pixel_values'])
all_image_sizes.append(inputs['image_sizes'])
whether each turn should has response = f'{assistant_text}<|end|>\n<|endoftext|>'
endoftext token or just the final turn.
and more important can I build my data as below:
<|image_1|>\n<|user|>\n<|end|>\n<|assistant|>\n<|end|>\n<|user|>\n<end>\n<|assistant|>\n<|end|>\n<|endoftext|>
in this example conversation dict[list] use only one 'image': [image]
and next turn has empty 'image': []
dict
how entire for loop should work then ?
This issue is for a: (mark with an
x
)Minimal steps to reproduce
example of
ucf101
dataset converted from fine-tune phi3.5 visionIm using same code as was provided for
ucf101
datasetand got error msg:
Wich is OK to understand but how to change
__getitem__
loop to handle multiple imagesAny log messages given by the failure
Expected/desired behavior
OS and Version?
Versions
Mention any other details that might be useful