meta-llama / llama-models

Utilities intended for use with Llama models.
Other
4.85k stars 834 forks source link

llama3.2 training loss is always zero #198

Open hessaAlawwad opened 3 weeks ago

hessaAlawwad commented 3 weeks ago

Hello,

I am tryoing to SFT train llama3.2 11B vision instruct model. on a dataset that answer a question on an image using a context (could be more than one image). My code is:

def format_data(sample):
    # Load images from the sample
    images = load_images(sample.get("image", []))

    # Extract images as needed
    q_image = images[0] 

    # Extract the answer
    answer = next((conv["value"] for conv in sample.get("conversations", []) if conv.get("from") == "gpt"), "no answer")

    # Prepare the messages array for the model input
    # now we define an initial model prompt defining the task and giving the model the context passage
    instruction_prompt_template = '''
    You are a helpful assistant tasked with answering questions from a given multimodal context (images and texts). Please infer the answer from the context and respond.'

    Context: {context}'''

    # Prepare the messages array for the model input
    messages = [{"role": "user", "content": []}]
    messages[0]['content'].append({"type": "text", "text": instruction_prompt_template.format(context=context)})
    messages[0]["content"].append({"type": "image", "image": q_image})
    messages[0]["content"].append({"type": "text", "text": question})

    sample_conversation = tokenizer.apply_chat_template(messages, tokenize=False)
    return {"text": sample_conversation, "messages": messages, "answer": answer}

and I am trying to define a collator function for the SFT trainer.

  1. My first question, when I prepare the text column, do I format it like:

<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful assistant tasked with answering questions from a given context. Please infer the answer from the context and respond. Context: Most fossils are preserved by one of five processes outlined below (Figure 1.1): 1. What is the traditional definition of gravity? 2. Identify factors that influence the strength of gravity between two objects. Despite these problems, there is a rich fossil record. How does an organism become fossilized? <|image|> How many actions are depicted in the diagram?<|eot_id|> <|start_header_id|>assistant<|end_header_id|>7<|eot_id|> by placing this <|image|> placeholder? or shall insert the actual image? or the path?

  1. My second question. I am not sure how to define the collator function. I am getting all zero training loss and I think this is due to calculating the loss for the whole response. how can I defin this collator?

thank you in advance

varunfb commented 1 week ago

Please check out the documentation in the prompt format guide from here. The placement of the <|image|> tag is important. Text prompt should always be after the image tag not before that. Image is not part of the prompt. You can learn more about how the images are handled in llama from here