How to fine-tune LLaMA 3.2 11B Vision using LoRA with the recent update?

yukiarimo commented 12 hours ago

I saw you used something like this:

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision part
    finetune_language_layers   = True, # False if not finetuning language part
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 16,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

But for (not vision) LLaMA 3.1 8B I used something like that:

model = FastLanguageModel.get_peft_model(
    model,
    r = 256, # 128 or 256
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head"
                      ],
    lora_alpha = 256, # 128 or 256
    lora_dropout = 0.1,
    bias = "all", # "all"
    use_gradient_checkpointing = "unsloth", # True - don't use False
    random_state = 42,
    use_rslora = True,
    loftq_config = None, # And LoftQ
)

So, can I do the same and what are these new options?

finetune_vision_layers     = True, # False if not finetuning vision part
finetune_language_layers   = True, # False if not finetuning language part
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules       = True, # False if not finetuning MLP layers

Also, my (raw text only) dataset looks like this:

<|begin_of_text|>
<dialog>
<kanojo>You're a know-it-all girl.</kanojo>
<yuki>How are you?</yuki>
<yuna>I'm fine</yuna>
<yuki>Who is Elon Musk?</yuki>
<yuna>He's a cool guy</yuna>

So, for the image how do I do that? Can I make something like:

<|begin_of_text|>
<dialog>
<kanojo>You're a know-it-all girl.</kanojo>
<yuki>How are you?</yuki>
<yuna>I'm fine</yuna>
<yuki>Please describe how do I look like? <data>{image_tokens}</data></yuki>
<yuna>You're adorable!</yuna>

Note: <yuki>, </yuki>, <yuna>, </yuna>, <data>, </data>, <kanojo>, </kanojo>, and <dialog> are custom special tokens added by me in the vocabulary!

danielhanchen commented 7 hours ago

Hey! Oh hmm for now you need to have 1 image paired with text during finetuning - I'm working on allowing (text only) + (text + image) finetuning, but for now that'll require a custom data collator

yukiarimo commented 6 hours ago

I see. My data collector is:

def formatting_prompts_func(examples):
    texts = examples["text"]
    return {"text": texts}

dataset = load_dataset("json", data_files="/content/drive/MyDrive/datasets/all.jsonl")

I would like to know how to put the image (image tokens) inside so I can hard code maybe and drop it raw dataset as I did before. Also, I would like to not use it in the beginning and maybe do multiple. Any suggestions?

unslothai / unsloth

How to fine-tune LLaMA 3.2 11B Vision using LoRA with the recent update? #1319