Multi turn conversation | Phi3.5 vision

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [x ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Please provide documentation how to work with multi turn conversation for dataset

{"id": "000000043", "source": "A", "conversations": [{"images": ["image1.png"], "user": "USER_QUESTION_1", "assistant": "ANSWER1"}, {"images": ["image2.png"], "user": "USER_QUESTION_2", "assistant": "ANSWER_2"}]}

example of ucf101 dataset converted from fine-tune phi3.5 vision

Im using same code as was provided for ucf101 dataset

class Phi3VDataset(Dataset):
    def __init__(self, jsonl_file: str, image_dir: str, processor):
        self.image_dir = Path(image_dir)
        with open(jsonl_file) as f:
            self.examples = [json.loads(line) for line in f]
        self.processor = processor

    def __len__(self):
        return len(self.examples)

    def shard(self, num_shards, shard_id):
        num_data = len(self.examples)
        sharded = copy.deepcopy(self)
        sharded.examples = [self.examples[i] for i in range(shard_id, num_data, num_shards)]
        return sharded

    def _get_inputs(self, user_text, image_paths):
        images = [Image.open(self.image_dir / image_path) for image_path in image_paths]
        image_tag_text = ''.join([f'<|image_{i}|>' for i in range(1, len(images) + 1)])

        prompt_message = {'role': 'user', 'content': f'{image_tag_text}\n{user_text}'}
        prompt = self.processor.tokenizer.apply_chat_template(
            [prompt_message], tokenize=False, add_generation_prompt=True
        )
        inputs = self.processor(prompt, images, return_tensors='pt')
        return inputs

    def __getitem__(self, idx):
        example = self.examples[idx]

        all_input_ids = []
        all_labels = []
        all_pixel_values = []
        all_image_sizes = []
        for turn in example['conversations']:
            inputs = self._get_inputs(turn['user'], turn['images'])
            prompt_input_ids = inputs['input_ids']

            assistant_text = turn['assistant']
            response = f'{assistant_text}<|end|>\n<|endoftext|>'
            # Do not add bos token to answer
            response_input_ids = self.processor.tokenizer(
                response, add_special_tokens=False, return_tensors='pt'
            )['input_ids']

            input_ids = torch.cat([prompt_input_ids, response_input_ids], dim=1).squeeze(0)
            labels = torch.cat(
                [
                    torch.tensor([IGNORE_INDEX] * len(prompt_input_ids[0])),
                    response_input_ids.squeeze(0),
                ],
                dim=0,
            )

            all_input_ids.append(input_ids)
            all_labels.append(labels)
            all_pixel_values.append(inputs['pixel_values'])
            all_image_sizes.append(inputs['image_sizes'])

        input_ids = torch.cat(all_input_ids, dim=0)
        labels = torch.cat(all_labels, dim=0)
        pixel_values = torch.cat(all_pixel_values, dim=0)
        image_sizes = torch.cat(all_image_sizes, dim=0)

        return {
            'id': example['id'],  # unique identifier for the example
            'input_ids': input_ids,
            'labels': labels,
            'pixel_values': pixel_values,
            'image_sizes': image_sizes,
        }

and got error msg:

[rank0]:     assert len(unique_image_ids) == len(images), f"total images must be the same as the number of image tags, got {len(unique_image_ids)} image tags and {len(images)} images"
[rank0]: AssertionError: total images must be the same as the number of image tags, got 1 image tags and 2 images

Wich is OK to understand but how to change __getitem__ loop to handle multiple images

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Linux

azd version?

run azd version and copy paste here.

Versions

Phi3.5

Mention any other details that might be useful

Thanks! We'll be in touch soon.

It looks like you're working on a multi-turn conversation dataset and encountered an error related to the number of image tags not matching the number of images. Let's break down the issue and see how we can resolve it.

Understanding the Error

The error message indicates that the number of unique image tags does not match the number of images provided. Specifically, you have 1 image tag but 2 images. This discrepancy is causing the assertion to fail.

Potential Fixes

Check Image Tag Generation: Ensure that the number of image tags generated matches the number of images. In your _get_inputs method, the image_tag_text should have as many tags as there are images.
Verify Input Data: Make sure that each conversation turn in your dataset has the correct number of images and corresponding tags.

Here's a modified version of your _get_inputs method to ensure the number of image tags matches the number of images:

def _get_inputs(self, user_text, image_paths):
    images = [Image.open(self.image_dir / image_path) for image_path in image_paths]
    image_tag_text = ''.join([f'<|image_{i+1}|>' for i in range(len(images))])

    prompt_message = {'role': 'user', 'content': f'{image_tag_text}\n{user_text}'}
    prompt = self.processor.tokenizer.apply_chat_template(
        [prompt_message], tokenize=False, add_generation_prompt=True
    )
    inputs = self.processor(prompt, images, return_tensors='pt')
    return inputs

Documentation for Multi-Turn Conversations

To work with multi-turn conversations in your dataset, you need to ensure that each turn is processed correctly and that the inputs and outputs are aligned. Here's a general approach:

Dataset Structure: Your dataset should have a structure where each example contains multiple conversation turns, with each turn having user input, assistant response, and any associated images.
Processing Each Turn: For each turn, generate the appropriate input IDs and labels. Concatenate these for all turns in an example.
Handling Images: Ensure that the number of image tags matches the number of images for each turn.

Example Code

Here's a simplified version of your dataset class with the necessary adjustments:


class Phi3VDataset(Dataset):
    def __init__(self, jsonl_file: str, image_dir: str, processor):
        self.image_dir = Path(image_dir)
        with open(jsonl_file) as f:
            self.examples = [json.loads(line) for line in f]
        self.processor = processor

    def __len__(self):
        return len(self.examples)

    def shard(self, num_shards, shard_id):
        num_data = len(self.examples)
        sharded = copy.deepcopy(self)
        sharded.examples = [self.examples[i] for i in range(shard_id, num_data, num_shards)]
        return sharded

    def _get_inputs(self, user_text, image_paths):
        images = [Image.open(self.image_dir / image_path) for image_path in image_paths]
        image_tag_text = ''.join([f'<|image_{i+1}|>' for i in range(len(images))])

        prompt_message = {'role': 'user', 'content': f'{image_tag_text}\n{user_text}'}
        prompt = self.processor.tokenizer.apply_chat_template(
            [prompt_message], tokenize=False, add_generation_prompt=True
        )
        inputs = self.processor(prompt, images, return_tensors='pt')
        return inputs

    def __getitem__(self, idx):
        example = self.examples[idx]

        all_input_ids = []
        all_labels = []
        all_pixel_values = []
        all_image_sizes = []
        for turn in example['conversations']:
            inputs = self._get_inputs(turn['user'], turn['images'])
            prompt_input_ids = inputs['input_ids']

            assistant_text = turn['assistant']
            response = f'{assistant_text}<|end|>\n

ty for your answer @leestott.

I think the main misunderstandings is in for loop:

 for turn in example['conversations']:
            inputs = self._get_inputs(turn['user'], turn['images'])
            prompt_input_ids = inputs['input_ids']

            assistant_text = turn['assistant']
            response = f'{assistant_text}<|end|>\n<|endoftext|>'
            # Do not add bos token to answer
            response_input_ids = self.processor.tokenizer(
                response, add_special_tokens=False, return_tensors='pt'
            )['input_ids']

            input_ids = torch.cat([prompt_input_ids, response_input_ids], dim=1).squeeze(0)
            labels = torch.cat(
                [
                    torch.tensor([IGNORE_INDEX] * len(prompt_input_ids[0])),
                    response_input_ids.squeeze(0),
                ],
                dim=0,
            )

            all_input_ids.append(input_ids)
            all_labels.append(labels)
            all_pixel_values.append(inputs['pixel_values'])
            all_image_sizes.append(inputs['image_sizes'])

whether each turn should has response = f'{assistant_text}<|end|>\n<|endoftext|>' endoftext token or just the final turn.

and more important can I build my data as below:

<|image_1|>\n<|user|>\n<|end|>\n<|assistant|>\n<|end|>\n<|user|>\n<end>\n<|assistant|>\n<|end|>\n<|endoftext|>

in this example conversation dict[list] use only one 'image': [image] and next turn has empty 'image': [] dict

how entire for loop should work then ?

microsoft / Phi-3CookBook