[Kosmos-2] Fine-tune your checkpoint model on my downstream task

basteran commented 8 months ago

Hello everyone, thank you very much for your contribution. I appreciate the effort and consistency in uploading the code for such many models and maintaining this repository.

I saw Kosmos-2 and I quickly thought I could fine-tune it on my downstream task. But I couldn't find any example of how to do it. I see there is here a little "guide" for Training the model, but I don't know if you're referring to the Pre-training or further fine-tuning, I'm interested in the second one.

So I tried to implement it myself using the transformers library, but I'm getting errors about the data.

Here is my environment:

``` accelerate==0.25.0 ai2thor==5.0.0 aiofiles==23.2.1 aiohttp==3.9.1 aiosignal==1.3.1 altair==5.2.0 annotated-types==0.6.0 antlr4-python3-runtime==4.8 anyio==4.2.0 apex @ file:///home/user/unilm/kosmos-2/apex async-timeout==4.0.3 attrs==23.2.0 aws-requests-auth==0.4.3 bitarray==2.9.2 blinker==1.7.0 blis==0.7.11 botocore==1.34.12 canonicaljson==2.0.0 catalogue==2.0.10 certifi==2023.11.17 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 colorama==0.4.6 confection==0.1.4 contourpy==1.2.0 cryptography==41.0.7 cycler==0.12.1 cymem==2.0.8 Cython==3.0.7 datasets==2.16.1 decorator==4.4.2 deepspeed @ git+https://github.com/microsoft/DeepSpeed.git@165739a508431c9d05a456ca68535edf599cc51f Deprecated==1.2.14 dill==0.3.7 exceptiongroup==1.2.0 fairscale==0.4.0 fairseq @ file:///home/user/unilm/kosmos-2/fairseq fastapi==0.108.0 ffmpy==0.3.1 filelock==3.13.1 Flask==3.0.0 fonttools==4.47.0 frozenlist==1.4.1 fsspec==2023.10.0 ftfy==6.1.3 gradio==3.37.0 gradio_client==0.8.0 h11==0.14.0 httpcore==0.17.3 httpx==0.25.1 huggingface-hub==0.20.2 hydra-core==1.0.7 idna==3.6 imageio==2.33.1 imageio-ffmpeg==0.4.9 infinibatch @ file:///home/user/unilm/kosmos-2/infinibatch itsdangerous==2.1.2 Jinja2==3.1.2 jmespath==1.0.1 jsonschema==4.20.0 jsonschema-specifications==2023.12.1 kiwisolver==1.4.5 langcodes==3.3.0 linkify-it-py==2.0.2 lxml==5.1.0 markdown-it-py==2.2.0 MarkupSafe==2.1.3 matplotlib==3.8.2 mdit-py-plugins==0.3.3 mdurl==0.1.2 moviepy==1.0.3 msgpack==1.0.7 multidict==6.0.4 multiprocess==0.70.15 murmurhash==1.0.10 natsort==8.4.0 networkx==3.2.1 ninja==1.11.1.1 numpy==1.23.0 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 omegaconf==2.0.6 open-clip-torch @ file:///home/user/unilm/kosmos-2/open_clip opencv-python==4.9.0.80 opencv-python-headless==4.8.0.74 orjson==3.9.10 packaging==23.2 pandas==2.1.4 pathy==0.10.3 pillow==10.2.0 portalocker==2.8.2 preshed==3.0.9 prior==1.0.3 proglog==0.1.10 progressbar2==4.3.2 protobuf==3.20.3 psutil==5.9.7 pyarrow==14.0.2 pyarrow-hotfix==0.6 pycparser==2.21 pydantic==1.10.11 pydantic_core==2.14.6 pydub==0.25.1 PyGithub==2.1.1 PyJWT==2.8.0 PyNaCl==1.5.0 pyparsing==3.1.1 python-dateutil==2.8.2 python-dotenv==1.0.0 python-fcl==0.7.0.5 python-multipart==0.0.6 python-sat==0.1.8.dev12 python-utils==3.8.1 python-xlib==0.33 pytz==2023.3.post1 PyYAML==6.0.1 referencing==0.32.1 regex==2023.12.25 requests==2.31.0 rpds-py==0.16.2 sacrebleu==2.4.0 safetensors==0.4.1 scipy==1.8.0 semantic-version==2.10.0 sentencepiece==0.1.99 shapely==2.0.2 six==1.16.0 smart-open==6.4.0 sniffio==1.3.0 spacy==3.6.0 spacy-legacy==3.0.12 spacy-loggers==1.0.5 srsly==2.4.8 starlette==0.32.0.post1 tabulate==0.9.0 tensorboardX==1.8 thinc==8.1.10 tiktoken==0.5.2 timm==0.4.12 tokenizers==0.15.0 toolz==0.12.0 torch==1.13.0 torchscale @ file:///home/user/unilm/kosmos-2/torchscale torchvision==0.14.0 tqdm==4.66.1 transformers==4.36.2 trimesh==4.0.8 triton==2.1.0 typer==0.9.0 typing_extensions==4.9.0 tzdata==2023.4 uc-micro-py==1.0.2 urllib3==2.0.7 uvicorn==0.25.0 wasabi==1.1.2 wcwidth==0.2.13 websockets==11.0.3 Werkzeug==3.0.1 wrapt==1.16.0 xxhash==3.4.1 yarl==1.9.4 ```

I paste here my code:

model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224", device_map="auto")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224", device_map="auto")

# load dummy dataset from json file
train_data = load_dataset("json", data_files=tmp_train_file_name)
val_data = load_dataset("json", data_files=tmp_val_file_name)

# process the inputs, i.e. images and texts
def kosmos2_collate_fn(examples):
    images, texts = [], []
    for example in examples:
        image = Image.open(example['image_path'])
        images.append(image)
        texts.append(example['input_text'])

    inputs = processor(text=texts, images=images, return_tensors="pt").to(model.device)
    return Dataset.from_dict(inputs)

new_train_data = kosmos2_collate_fn(train_data)
new_val_data = kosmos2_collate_fn(val_data)

training_arguments = TrainingArguments(
    remove_unused_columns=False, 
    per_device_train_batch_size=MICRO_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    warmup_ratio=0,
    num_train_epochs=EPOCHS,
    learning_rate=LEARNING_RATE,
    logging_strategy="steps",
    logging_steps=1,
    optim="adamw_torch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    output_dir=OUTPUT_DIR,
    save_total_limit=1,
    load_best_model_at_end=True,
    label_names=["labels"]
)

trainer = Trainer(
    model=model,
    train_dataset=new_train_data,
    eval_dataset=new_val_data,
    args=training_arguments,
)

trainer.train()

and the resulting errors:

Generating train split: 40 examples [00:00, 8627.15 examples/s]
Generating train split: 6 examples [00:00, 2428.20 examples/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
  0%|          | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/user/kosmos2/train.py", line 193, in <module>
    trainer.train()
  File "/home/user/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2776, in compute_loss
    raise ValueError(
ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values,image_embeds,projection_attentions,vision_model_output. For reference, the inputs it received are pixel_values,input_ids,attention_mask,image_embeds_position_mask.
  0%|          | 0/10 [00:03<?, ?it/s]

I can't figure out the issue. It says that the model did not return a loss, which means it didn't compute it.

Can anyone help me? I tried to look for solutions online, but I found nothing useful. @ydshieh @donglixp @pengzhiliang @pineking

Thank you in advance.

piperino11 commented 8 months ago

Any news?

basteran commented 8 months ago

No news, still waiting for someone to answer. It looks like the processor did not return any labels and the Trainer could not compute the loss...

ydshieh commented 8 months ago

Hi! The (HF implementation) model can compute and return the loss if labels is sent to the model, but in your own training script, you need to prepare the labels. It's not the Kosmos2Processor to create labels field, but you can use it to prepare input_ids and pass it as labels (in your own training script). You will have to change some places to -100 (ignore_index ins torch's CrossEntropyLoss): the padded places + the first tokens that are used as prompts).

I hope this gives you some direction.

basteran commented 8 months ago

Thank you for the tip. I think I already figured that out and I'm trying to prepare the labels before Training, but I didn't think about masking the input prompt. I will update you in case that works!

Do you have any other information about the input? In this repository I cannot find anything useful.
How should I format the sentences in input before passing them to the Processor? My concern is: should I add the <grounding>, <object>, <patch_> and other tags before processing the sentence? Both the prompt and the labels..

ydshieh commented 8 months ago

Hi again.

Kosmos2Processor will behavior differently depending on the inputs you pass to it.

should I add the <grounding>, <object>, <patch> and other tags.

If you don't pass bboxes to Kosmos2Processor, yes, you need to pass <object>, <patch_>, delimiter_of_multi_objects. Otherwise Kosmos2Processor can create them by looking at bboxes argument.

You can choose either way to go. But in both case, you need to have bboxes available: the difference is on if you pass it as argument or adding them directly into the text (and don't pass it to the processor). If you decide to add them into text manually, see an example:

https://github.com/huggingface/transformers/blob/b2748a6efd045dd771f8fd48e8b309cbc061c618/src/transformers/models/kosmos2/modeling_kosmos2.py#L1795-L1801

tags like <grounding>, are </phrase> should be always specified by the users.
images argument is used to prepare the image inputs (as tensor) but also adding the leading tokens (ids) (to input_ids) that are used as placeholders to say those are reserved for images. It's best to pass this argument and let Kosmos2Processor do this part for you.

IMPORTANT: pay attention to add_eos_token: it defaults to False (this is necessary for inference), but for training, you need to set it to True (so you get the eos token id in the input)

basteran commented 8 months ago

Thank you @ydshieh for your invaluable help!

As an example, what I want to do is Image Grounded Command Interpretation, i.e. I have an image and a command from a human to a robot. The Robot needs to interpret the command and ground any relevant objects in the image, by producing the bounding box and the reference through the image patch.

The input would be '\<grounding> Go to \<phrase>the red car\</phrase>' coupled with an image where there's a red car. The output should be the interpretation with the patches that refer to the bounding boxes of the car, something like this 'Motion(Goal(<object><patch_indexes_here></object>))'.

Now, given your directions, I should format:

the prompt as '\<grounding> Go to \<phrase>the red car\</phrase>'
the labels as 'Motion(Goal(<object><patch_indexes_here></object>))'

I have stored the bounding boxes of the car in the image, but the patches should refer to both the prompt (as I am specifying the <phrase>) and the labels (for the <patches>)! Is there any way to make this with the Processor of Kosmos?

ydshieh commented 8 months ago

The Kosmos2Processor processing logic requires the <object><patch_indexes_here></object> stuff immediately after <phrase> xxx </phrase>, so if you want to have extra stuff like Motion(Goal(, you will have to make some modification (I know, it's not an easy stuff).

In terms of prompt and labels, you don't really need to separate them. Just use the whole text (well, make a copy of the the input_ids fields), and change the places where corresponding to non-labels to -100.

basteran commented 7 months ago

Hi @ydshieh I figured out how to change the Kosmos Processor in order to add those extra tokens and I started fine-tuning the model on my downstream task. But there are some problems with the bounding boxes that the model generates, they are totally off!

So I was wondering: is the Visual Model fine-tuned as well during this procedure? Or just the Language Model? It seems that the text the model generates is consistent with what I expect, i.e. the ground truth, but the Visual Encoder seems not aligned.

Do you have any considerations? Answers? Tips?

Thank you again.

ydshieh commented 7 months ago

Hi @basteran

The whole model is finetuned unless you do something particular to freeze the vision encoder part of it. (although, I would say, in your case, it might make more sense to freeze the vision encoder part)

You can use a small subset of (your) dataset, say 100 or 1000 examples, finetune the model (on the data prepared by your modified Kosmos2 processor), try to fit as much as possible, and see if the model can give the expected results during inference on the trained examples. If the model can't even give the desired results, then there is something wrong.

This question is best to be discussed on the HF forum. From my side, I don't have clear answer other than the above suggestion.

basteran commented 7 months ago

The whole model is finetuned unless you do something particular to freeze the vision encoder part of it. (although, I would say, in your case, it might make more sense to freeze the vision encoder part)

That's what I thought, but it doesn't make sense: the bounding boxes are totally off or on blank spaces..

You can use a small subset of (your) dataset, say 100 or 1000 examples, finetune the model (on the data prepared by your modified Kosmos2 processor), try to fit as much as possible, and see if the model can give the expected results during inference on the trained examples. If the model can't even give the desired results, then there is something wrong.

Thanks for the suggestion, I already tried that and the model didn't give the desired results. This made me think that maybe the Visual Encoder wasn't doing any fine-tuning

This question is best to be discussed on the HF forum. From my side, I don't have clear answer other than the above suggestion.

Well, thank you for your support until now. Do you have any plans on releasing a working example for fine-tuning the Kosmos-2 model on downstream tasks?

ydshieh commented 7 months ago

the Visual Encoder wasn't doing any fine-tuning

If you would like: you can actually compare the original checkpoint weights v.s the finetuned checkpoint. This way, you will know for sure.

Do you have any plans on releasing a working example for fine-tuning the Kosmos-2 model on downstream tasks?

Not in the (short-term) plan. Being said that, if you can provide a google colab show your training (of course, with a very small dataset you prepared, even 10 examples is enough), I would be happy to take a look.

basteran commented 7 months ago

@ydshieh I added you to a private repository containing my notebook and a sample of data to run the code. Let me know what you think and how we can communicate!

Ericodencoder commented 7 months ago

Kosmos2Processor

Hi basteran, I am also working on fine-tune KOSMOS-2, and meeting issues as you. I think one of the reason is KOSMOS-2 is a causal model, so you don't need to specify the 'labels' for the model, the trainer will make the mask automatically. And me, I am also looking forward a tutorial for fine-tune KOSMOS-2, @ydshieh if there will be convenience. Thanks a lot!

basteran commented 7 months ago

Kosmos2Processor

Hi basteran, I am also working on fine-tune KOSMOS-2, and meeting issues as you. I think one of the reason is KOSMOS-2 is a causal model, so you don't need to specify the 'labels' for the model, the trainer will make the mask automatically. And me, I am also looking forward a tutorial for fine-tune KOSMOS-2, @ydshieh if there will be convenience. Thanks a lot!

Hi, that's true but the causal masking is used during the unsupervised (pre) training of the model. If you want to perform classical supervised fine-tuning (i.e. input-output pairs) you need to specify the labels and set the masking, otherwise, the model will generate part of the input as well.

mit1280 commented 6 months ago

Hi @basteran, can you share your fine-tuned script if you can? I am getting same error as you

ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values,image_embeds,projection_attentions,vision_model_output. For reference, the inputs it received are pixel_values,input_ids,attention_mask,image_embeds_position_mask.

I am trying to fine-tune model using Kosmos2ForConditionalGeneration.

If you are having issue with bbox offset then refer https://discuss.huggingface.co/t/issue-with-kosmos-2-encoding-and-decoding/70019/2

basteran commented 6 months ago

Hi @mit1280, @ydshieh, I am sharing here with you a Notebook for the training. I tried to explain my task and what I changed in the original code:

Notebook

Unfortunately, I cannot add any example of the data and thus the code will not run correctly. Let me know your thoughts @ydshieh

Thank you again.

mit1280 commented 6 months ago

Thanks @basteran for sharing your work. I think it would be great if we move to huggingface discussion. I think we will get more inputs there.

Code looks almost similar to mine except dataset creation. I think we need to create custom loss function like this:

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):

        outputs = model(**inputs)
        logits = outputs.logits

        # Get predicted logits for the next token
        predicted_logits = logits[:, :-1].contiguous()

        # Flatten logits and labels for loss computation
        logits_flat = predicted_logits.view(-1, predicted_logits.size(-1))
        labels_flat = inputs["input_ids"][:, 1:].contiguous().view(-1)  # Assuming input_ids contains the input sequence

        # Calculate the cross-entropy loss
        loss = F.cross_entropy(logits_flat, labels_flat)

        return (loss, outputs) if return_outputs else loss

basteran commented 6 months ago

@mit1280 mention me on the hugging face discussion or send me a link, please.

mit1280 commented 6 months ago

@basteran there is none right now. I will create one and tag you there.

mit1280 commented 6 months ago

@basteran here you go https://discuss.huggingface.co/t/kosmos-fine-tuning/75691

ydshieh commented 6 months ago

See my response on https://discuss.huggingface.co/t/kosmos-fine-tuning/75691 (essentially, I need a code snippet or notebook that could run easily to see the issue)

basteran commented 6 months ago

See my response on https://discuss.huggingface.co/t/kosmos-fine-tuning/75691 (essentially, I need a code snippet or notebook that could run easily to see the issue)

I can add you to a private repository with the Notebook and the data (I don't know how to share it with you otherwise)

microsoft / unilm

[Kosmos-2] Fine-tune your checkpoint model on my downstream task #1429