Is there any plan to add kosmos-2 to the transformers.

BIGBALLON commented 1 year ago

Model description

Kosmos-2 is a grounded multimodal large language model, which integrates grounding and referring capabilities compared with Kosmos-1. The model can accept image regions selected by the user using bounding boxes as input, provide visual answers (i.e., bounding boxes), and ground the text output to the visual world.

Is there any plan to add this model to the transformers.

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

Code: https://github.com/microsoft/unilm/tree/master/kosmos-2 Paper: https://arxiv.org/abs/2306.14824 Weight: the checkpoint can be downloaded from here
VQA demo: here

ydshieh commented 1 year ago

Thank you for mentioning this :-). There is some early discussion within the team. I will come back to you once we have some decision.

ydshieh commented 1 year ago

This is tracked in PR #24709. (so far empty, but I will try to 🚀 )

BIGBALLON commented 1 year ago

@ydshieh I'm very excited to hear this news. I sincerely appreciate your efforts.

Rajmehta123 commented 1 year ago

any updates?

ydshieh commented 1 year ago

Still on it (slowly) 🤗

Rajmehta123 commented 1 year ago

Sure. Thank you. Appreciate those efforts.

yolandalalala commented 1 year ago

I just want to say a big thank you for your effort @ydshieh! Looking forward to it.

BIGBALLON commented 1 year ago

@Rajmehta123 @yolandalalala @vanpelt

This project can be provided for everyone to try, I hope it can help everyone

ydshieh commented 1 year ago

Very nice! @BIGBALLON Thanks a lot!

BIGBALLON commented 1 year ago

@ydshieh Thank you again for your great contribution!

hujunchao commented 1 year ago

Amazing! @BIGBALLON Thanks a lot!

ydshieh commented 1 year ago

Just want to give a update: I am almost done the coding - just need to put everything together to finalize.

(The model might ends up as a custom code on the Hub instead of directly available in transformers - I am not sure)

BIGBALLON commented 1 year ago

Hi, @ydshieh , is there any update? 😄

Just want to give a update: I am almost done the coding - just need to put everything together to finalize.

(The model might ends up as a custom code on the Hub instead of directly available in transformers - I am not sure)

ydshieh commented 1 year ago

Hi @BIGBALLON

Sorry for the delay due to some internal task! I will make it available on the Hub this week 🔥 . (And we will see if it would be directly in transformers later).

ydshieh commented 1 year ago

Hi @BIGBALLON @yolandalalala @Rajmehta123

As promised, I put the code on this HuggingFace Hub repository ydshieh/kosmos-2-patch14-224. You can use it like the code snippet at the end. It will give something like (when specifying cleanup_and_extract=True to post_processor_generation

An image of a snowman warming himself by a fire.

This means:

A text description: An image of a snowman warming himself by a fire.

and 2 objects

a snowman: position 44-863 a fire: position 5-911 (position described as patch indices)

This information is given (with the default value cleanup_and_extract=True for post_process_generation) as:

clean text: An image of a snowman warming himself by a fire.
entities: [('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a fire', (41, 47), [(0.171875, 0.015625, 0.484375, 0.890625)])] (the patch indices are converted to coordinates)

~~I will provide a more complete post-processing function though~~ .

Please share your feedback on this (remote) model 🙏 ❤️ !

Note that if this model would be added into transformers codebase, there might be some changes which I could not guarantee it won't break the current behavior.

Example

import requests

from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq

model = AutoModelForVision2Seq.from_pretrained("ydshieh/kosmos-2-patch14-224", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("ydshieh/kosmos-2-patch14-224", trust_remote_code=True)

prompt = "<grounding>An image of"

url = "https://huggingface.co/ydshieh/kosmos-2-patch14-224/resolve/main/snowman.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# The original Kosmos-2 demo saves the image first then reload it. For some images, this will give slightly different image input and change the generation outputs.
# Uncomment the following 2 lines if you want to match the original demo's outputs.
# (One example is the `two_dogs.jpg` from the demo)
# image.save("new_image.jpg")
# image = Image.open("new_image.jpg")

inputs = processor(text=prompt, images=image, return_tensors="pt")

generated_ids = model.generate(
    pixel_values=inputs["pixel_values"],
    input_ids=inputs["input_ids"][:, :-1],
    attention_mask=inputs["attention_mask"][:, :-1],
    img_features=None,
    img_attn_mask=inputs["img_attn_mask"][:, :-1],
    use_cache=True,
    max_new_tokens=64,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Specify `cleanup_and_extract=False` in order to see the raw model generation.
processed_text = processor.post_process_generation(generated_text, cleanup_and_extract=False)

print(processed_text)
# `<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming himself by<phrase> a fire</phrase><object><patch_index_0005><patch_index_0911></object>.`

# By default, the generated  text is cleanup and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)

print(processed_text)
# `An image of a snowman warming himself by a fire.`

print(entities)
# `[('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a fire', (41, 47), [(0.171875, 0.015625, 0.484375, 0.890625)])]`

Draw the bounding bboxes of the entities on the image

Once you have the entities, you can use this helper function to draw their bounding bboxes on the image.

annotated_snowman

Rajmehta123 commented 1 year ago

Amazing work. Thank you.

Rajmehta123 commented 1 year ago

Can this model be used for Q&A?

ydshieh commented 1 year ago

I am also trying to see what this model can do - in the paper, it can do more things than what the demo demonstrates

BIGBALLON commented 1 year ago

@ydshieh thanks again for your effort!!!

@Rajmehta123 VQA is supported only to change the prompts.

ydshieh commented 1 year ago

@BIGBALLON

Yes, this model seems to be capable of doing quite different things, but it's challenging to showing this in a demo. I am still looking what I can add, but please share your ideas too if any 🙏

BIGBALLON commented 1 year ago

Hi, @ydshieh ,

there are some suggestions: as for gradio app demo, we can provide two outputs, Text and Image

for Visual Grounding task: use <grounding> with your grounding question for prompt, and the output image -> Image
for VQA task: do not use , and only descript your question, then the output text -> Text

the keypoint is prompt, check this for more detials : https://github.com/BIGBALLON/kosmos-2-gd

ydshieh commented 1 year ago

Hi everyone!

I put a Tasks section in the README.md file

https://huggingface.co/ydshieh/kosmos-2-patch14-224/blob/main/README.md#tasks

@BIGBALLON For VQA, I also use <grounding>, just like the official demo for image captioning uses <grounding>. It works well however.

rabiulcste commented 11 months ago

@ydshieh Any plan for supporting beam search in text-generaton?

ydshieh commented 11 months ago

The beam search is already supported in text-generaton for a long time. For this model, its default is beam size = 3.

rabiulcste commented 11 months ago

The beam search is already supported in text-generaton for a long time. For this model, its default is beam size = 3.

I get this error while trying to use num_beams = 3. If set use_cache=False, the issue resolves, but the generation becomes 50x slower.

NotImplementedError: Make sure that a _reorder_cache function is correctly implemented in transformers_modules.ydshieh.kosmos-2-patch14-224.48e3edebaeb02dc9fe105f40e85a43a3b440dc72.modeling_kosmos2 to enable beam search for <class 'transformers_modules.ydshieh.kosmos-2-patch14-224.48e3edebaeb02dc9fe105f40e85a43a3b440dc72.modeling_kosmos2.Kosmos2TextForCausalLM'>

The full args for generation below.

model_name == "kosmos2":
          generated_ids = model.generate(
              pixel_values=batch["pixel_values"].to("cuda"),
              input_ids=batch["input_ids"][:, :-1].to("cuda"),
              attention_mask=batch["attention_mask"][:, :-1].to("cuda"),
              img_features=None,
              img_attn_mask=batch["img_attn_mask"][:, :-1].to("cuda"),
              max_new_tokens=args.max_length,
              length_penalty=args.length_penalty,
              num_beams=args.num_beams,
          )

rabiulcste commented 11 months ago

I copy and paste this snippet from llama to kosmos2. Now working fine.

https://github.com/huggingface/transformers/blob/62b20c9ecd6c9d2295265187a51ba0ea74ce046c/src/transformers/models/llama/modeling_llama.py#L901

ydshieh commented 11 months ago

Thank you for reporting. I will take a look, kinda strange here. Note there will be an official port in transformers soon, and my personal code on the Hub won't be the best place to use this model.

rabiulcste commented 11 months ago

Thank you for reporting. I will take a look, kinda strange here. Note there will be an official port in transformers soon, and my personal code on the Hub won't be the best place to use this model.

Thanks, I will wait for the official support then!

rabiulcste commented 10 months ago

I'm getting an error on batch inference.

  model = AutoModelForVision2Seq.from_pretrained(model_name_or_path, torch_dtype=autocast_dtype, device_map="auto")
  processor = AutoProcessor.from_pretrained(model_name_or_path)
  processor.tokenizer.padding_side = "left"

  ----
  encoding = processor(
  images=processed_batch["image"],
  text=processed_batch["prompted_question"],
  padding=True,
  return_tensors="pt",
)

  File "/path/site-packages/transformers/tokenization_utils_base.py", line 720, in as_tensor
    return torch.tensor(value)
ValueError: expected sequence of length 90 at dim 1 (got 89)

The above exception was the direct cause of the following exception:
-------
 raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

ydshieh commented 10 months ago

Hi @rabiulcste Thank you for reporting.

Could you provide a complete code snippet? There are missing variable definitions above and I can't run it directly. Thank you!

huggingface / transformers