Open basteran opened 10 months ago
Any news?
No news, still waiting for someone to answer. It looks like the processor
did not return any labels
and the Trainer
could not compute the loss...
Hi! The (HF implementation) model can compute and return the loss if labels
is sent to the model, but in your own training script, you need to prepare the labels. It's not the Kosmos2Processor
to create labels
field, but you can use it to prepare input_ids
and pass it as labels
(in your own training script). You will have to change some places to -100
(ignore_index
ins torch's CrossEntropyLoss
): the padded places + the first tokens that are used as prompts).
I hope this gives you some direction.
Thank you for the tip. I think I already figured that out and I'm trying to prepare the labels before Training, but I didn't think about masking the input prompt. I will update you in case that works!
Do you have any other information about the input? In this repository I cannot find anything useful.
How should I format the sentences in input before passing them to the Processor
? My concern is: should I add the <grounding>,
<object>
, <patch_>
and other tags before processing the sentence? Both the prompt and the labels..
Hi again.
Kosmos2Processor
will behavior differently depending on the inputs you pass to it.
should I add the
<grounding>
,<object>
,<patch>
and other tags.
If you don't pass bboxes
to Kosmos2Processor
, yes, you need to pass <object>, <patch_>
, delimiter_of_multi_objects
. Otherwise Kosmos2Processor
can create them by looking at bboxes
argument.
You can choose either way to go. But in both case, you need to have bboxes available: the difference is on if you pass it as argument or adding them directly into the text (and don't pass it to the processor). If you decide to add them into text manually, see an example:
tags like <grounding>
, </phrase>
should be always specified by the users.
images
argument is used to prepare the image inputs (as tensor) but also adding the leading tokens (ids) (to input_ids
) that are used as placeholders to say those are reserved for images. It's best to pass this argument and let Kosmos2Processor
do this part for you.
IMPORTANT: pay attention to add_eos_token
: it defaults to False
(this is necessary for inference), but for training, you need to set it to True
(so you get the eos token id in the input)
Thank you @ydshieh for your invaluable help!
As an example, what I want to do is Image Grounded Command Interpretation, i.e. I have an image and a command from a human to a robot. The Robot needs to interpret the command and ground any relevant objects in the image, by producing the bounding box and the reference through the image patch.
The input would be '\<grounding> Go to \<phrase>the red car\</phrase>' coupled with an image where there's a red car. The output should be the interpretation with the patches that refer to the bounding boxes of the car, something like this 'Motion(Goal(<object><patch_indexes_here></object>
))'.
Now, given your directions, I should format:
<object><patch_indexes_here></object>
))' I have stored the bounding boxes of the car in the image, but the patches should refer to both the prompt (as I am specifying the <phrase>
) and the labels (for the <patches>
)! Is there any way to make this with the Processor
of Kosmos?
The Kosmos2Processor
processing logic requires the <object><patch_indexes_here></object>
stuff immediately after <phrase> xxx </phrase>
, so if you want to have extra stuff like Motion(Goal(
, you will have to make some modification (I know, it's not an easy stuff).
In terms of prompt
and labels
, you don't really need to separate them. Just use the whole text (well, make a copy of the the input_ids
fields), and change the places where corresponding to non-labels to -100
.
Hi @ydshieh I figured out how to change the Kosmos Processor in order to add those extra tokens and I started fine-tuning the model on my downstream task. But there are some problems with the bounding boxes that the model generates, they are totally off!
So I was wondering: is the Visual Model fine-tuned as well during this procedure? Or just the Language Model? It seems that the text the model generates is consistent with what I expect, i.e. the ground truth, but the Visual Encoder seems not aligned.
Do you have any considerations? Answers? Tips?
Thank you again.
Hi @basteran
The whole model is finetuned unless you do something particular to freeze the vision encoder part of it. (although, I would say, in your case, it might make more sense to freeze the vision encoder part)
You can use a small subset of (your) dataset, say 100 or 1000 examples, finetune the model (on the data prepared by your modified Kosmos2 processor), try to fit as much as possible, and see if the model can give the expected results during inference on the trained examples. If the model can't even give the desired results, then there is something wrong.
This question is best to be discussed on the HF forum. From my side, I don't have clear answer other than the above suggestion.
The whole model is finetuned unless you do something particular to freeze the vision encoder part of it. (although, I would say, in your case, it might make more sense to freeze the vision encoder part)
That's what I thought, but it doesn't make sense: the bounding boxes are totally off or on blank spaces..
You can use a small subset of (your) dataset, say 100 or 1000 examples, finetune the model (on the data prepared by your modified Kosmos2 processor), try to fit as much as possible, and see if the model can give the expected results during inference on the trained examples. If the model can't even give the desired results, then there is something wrong.
Thanks for the suggestion, I already tried that and the model didn't give the desired results. This made me think that maybe the Visual Encoder wasn't doing any fine-tuning
This question is best to be discussed on the HF forum. From my side, I don't have clear answer other than the above suggestion.
Well, thank you for your support until now. Do you have any plans on releasing a working example for fine-tuning the Kosmos-2 model on downstream tasks?
the Visual Encoder wasn't doing any fine-tuning
If you would like: you can actually compare the original checkpoint weights v.s the finetuned checkpoint. This way, you will know for sure.
Do you have any plans on releasing a working example for fine-tuning the Kosmos-2 model on downstream tasks?
Not in the (short-term) plan. Being said that, if you can provide a google colab show your training (of course, with a very small dataset you prepared, even 10 examples is enough), I would be happy to take a look.
@ydshieh I added you to a private repository containing my notebook and a sample of data to run the code. Let me know what you think and how we can communicate!
Kosmos2Processor
Hi basteran, I am also working on fine-tune KOSMOS-2, and meeting issues as you. I think one of the reason is KOSMOS-2 is a causal model, so you don't need to specify the 'labels' for the model, the trainer will make the mask automatically. And me, I am also looking forward a tutorial for fine-tune KOSMOS-2, @ydshieh if there will be convenience. Thanks a lot!
Kosmos2Processor
Hi basteran, I am also working on fine-tune KOSMOS-2, and meeting issues as you. I think one of the reason is KOSMOS-2 is a causal model, so you don't need to specify the 'labels' for the model, the trainer will make the mask automatically. And me, I am also looking forward a tutorial for fine-tune KOSMOS-2, @ydshieh if there will be convenience. Thanks a lot!
Hi, that's true but the causal masking is used during the unsupervised (pre) training of the model. If you want to perform classical supervised fine-tuning (i.e. input-output pairs) you need to specify the labels and set the masking, otherwise, the model will generate part of the input as well.
Hi @basteran, can you share your fine-tuned script if you can? I am getting same error as you
ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values,image_embeds,projection_attentions,vision_model_output. For reference, the inputs it received are pixel_values,input_ids,attention_mask,image_embeds_position_mask.
I am trying to fine-tune model using Kosmos2ForConditionalGeneration
.
If you are having issue with bbox offset then refer https://discuss.huggingface.co/t/issue-with-kosmos-2-encoding-and-decoding/70019/2
Hi @mit1280, @ydshieh, I am sharing here with you a Notebook for the training. I tried to explain my task and what I changed in the original code:
Unfortunately, I cannot add any example of the data and thus the code will not run correctly. Let me know your thoughts @ydshieh
Thank you again.
Thanks @basteran for sharing your work. I think it would be great if we move to huggingface discussion. I think we will get more inputs there.
Code looks almost similar to mine except dataset creation. I think we need to create custom loss function like this:
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
outputs = model(**inputs)
logits = outputs.logits
# Get predicted logits for the next token
predicted_logits = logits[:, :-1].contiguous()
# Flatten logits and labels for loss computation
logits_flat = predicted_logits.view(-1, predicted_logits.size(-1))
labels_flat = inputs["input_ids"][:, 1:].contiguous().view(-1) # Assuming input_ids contains the input sequence
# Calculate the cross-entropy loss
loss = F.cross_entropy(logits_flat, labels_flat)
return (loss, outputs) if return_outputs else loss
@mit1280 mention me on the hugging face discussion or send me a link, please.
@basteran there is none right now. I will create one and tag you there.
@basteran here you go https://discuss.huggingface.co/t/kosmos-fine-tuning/75691
See my response on https://discuss.huggingface.co/t/kosmos-fine-tuning/75691 (essentially, I need a code snippet or notebook that could run easily to see the issue)
See my response on https://discuss.huggingface.co/t/kosmos-fine-tuning/75691 (essentially, I need a code snippet or notebook that could run easily to see the issue)
I can add you to a private repository with the Notebook and the data (I don't know how to share it with you otherwise)
Hello everyone, thank you very much for your contribution. I appreciate the effort and consistency in uploading the code for such many models and maintaining this repository.
I saw Kosmos-2 and I quickly thought I could fine-tune it on my downstream task. But I couldn't find any example of how to do it. I see there is here a little "guide" for Training the model, but I don't know if you're referring to the Pre-training or further fine-tuning, I'm interested in the second one.
So I tried to implement it myself using the
transformers
library, but I'm getting errors about the data.Here is my environment:
``` accelerate==0.25.0 ai2thor==5.0.0 aiofiles==23.2.1 aiohttp==3.9.1 aiosignal==1.3.1 altair==5.2.0 annotated-types==0.6.0 antlr4-python3-runtime==4.8 anyio==4.2.0 apex @ file:///home/user/unilm/kosmos-2/apex async-timeout==4.0.3 attrs==23.2.0 aws-requests-auth==0.4.3 bitarray==2.9.2 blinker==1.7.0 blis==0.7.11 botocore==1.34.12 canonicaljson==2.0.0 catalogue==2.0.10 certifi==2023.11.17 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 colorama==0.4.6 confection==0.1.4 contourpy==1.2.0 cryptography==41.0.7 cycler==0.12.1 cymem==2.0.8 Cython==3.0.7 datasets==2.16.1 decorator==4.4.2 deepspeed @ git+https://github.com/microsoft/DeepSpeed.git@165739a508431c9d05a456ca68535edf599cc51f Deprecated==1.2.14 dill==0.3.7 exceptiongroup==1.2.0 fairscale==0.4.0 fairseq @ file:///home/user/unilm/kosmos-2/fairseq fastapi==0.108.0 ffmpy==0.3.1 filelock==3.13.1 Flask==3.0.0 fonttools==4.47.0 frozenlist==1.4.1 fsspec==2023.10.0 ftfy==6.1.3 gradio==3.37.0 gradio_client==0.8.0 h11==0.14.0 httpcore==0.17.3 httpx==0.25.1 huggingface-hub==0.20.2 hydra-core==1.0.7 idna==3.6 imageio==2.33.1 imageio-ffmpeg==0.4.9 infinibatch @ file:///home/user/unilm/kosmos-2/infinibatch itsdangerous==2.1.2 Jinja2==3.1.2 jmespath==1.0.1 jsonschema==4.20.0 jsonschema-specifications==2023.12.1 kiwisolver==1.4.5 langcodes==3.3.0 linkify-it-py==2.0.2 lxml==5.1.0 markdown-it-py==2.2.0 MarkupSafe==2.1.3 matplotlib==3.8.2 mdit-py-plugins==0.3.3 mdurl==0.1.2 moviepy==1.0.3 msgpack==1.0.7 multidict==6.0.4 multiprocess==0.70.15 murmurhash==1.0.10 natsort==8.4.0 networkx==3.2.1 ninja==1.11.1.1 numpy==1.23.0 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 omegaconf==2.0.6 open-clip-torch @ file:///home/user/unilm/kosmos-2/open_clip opencv-python==4.9.0.80 opencv-python-headless==4.8.0.74 orjson==3.9.10 packaging==23.2 pandas==2.1.4 pathy==0.10.3 pillow==10.2.0 portalocker==2.8.2 preshed==3.0.9 prior==1.0.3 proglog==0.1.10 progressbar2==4.3.2 protobuf==3.20.3 psutil==5.9.7 pyarrow==14.0.2 pyarrow-hotfix==0.6 pycparser==2.21 pydantic==1.10.11 pydantic_core==2.14.6 pydub==0.25.1 PyGithub==2.1.1 PyJWT==2.8.0 PyNaCl==1.5.0 pyparsing==3.1.1 python-dateutil==2.8.2 python-dotenv==1.0.0 python-fcl==0.7.0.5 python-multipart==0.0.6 python-sat==0.1.8.dev12 python-utils==3.8.1 python-xlib==0.33 pytz==2023.3.post1 PyYAML==6.0.1 referencing==0.32.1 regex==2023.12.25 requests==2.31.0 rpds-py==0.16.2 sacrebleu==2.4.0 safetensors==0.4.1 scipy==1.8.0 semantic-version==2.10.0 sentencepiece==0.1.99 shapely==2.0.2 six==1.16.0 smart-open==6.4.0 sniffio==1.3.0 spacy==3.6.0 spacy-legacy==3.0.12 spacy-loggers==1.0.5 srsly==2.4.8 starlette==0.32.0.post1 tabulate==0.9.0 tensorboardX==1.8 thinc==8.1.10 tiktoken==0.5.2 timm==0.4.12 tokenizers==0.15.0 toolz==0.12.0 torch==1.13.0 torchscale @ file:///home/user/unilm/kosmos-2/torchscale torchvision==0.14.0 tqdm==4.66.1 transformers==4.36.2 trimesh==4.0.8 triton==2.1.0 typer==0.9.0 typing_extensions==4.9.0 tzdata==2023.4 uc-micro-py==1.0.2 urllib3==2.0.7 uvicorn==0.25.0 wasabi==1.1.2 wcwidth==0.2.13 websockets==11.0.3 Werkzeug==3.0.1 wrapt==1.16.0 xxhash==3.4.1 yarl==1.9.4 ```I paste here my code:
and the resulting errors:
I can't figure out the issue. It says that the model did not return a loss, which means it didn't compute it.
Can anyone help me? I tried to look for solutions online, but I found nothing useful. @ydshieh @donglixp @pengzhiliang @pineking
Thank you in advance.