Real image input example

kyegomez / ScreenAI

Implementation of the ScreenAI model from the paper: "A Vision-Language Model for UI and Infographics Understanding"

https://discord.gg/GYbXvDGevY

MIT License

265 stars 26 forks source link

Real image input example #4

Closed JamshedAlamQaderi closed 5 months ago

JamshedAlamQaderi commented 5 months ago

Hello @kyegomez ,

Thank you so much for this awesome repo. I'm very excited to test this project. So, i've tried with example code but it gives me this below error

SyntaxError: Non-UTF-8 code starting with '\xff' in file C:\Users\alamj\Downloads\screenai.py on line 1, but no encoding declared; see https://peps.python.org/pep-0263/ for details

Could you use a real example of giving input image and text and converting them to vector and feed to the model. I really want to check it out

Thank you!

Upvote & Fund

We're using Polar.sh so you can upvote and help fund this issue.
We receive the funding once the issue is completed & confirmed by you.
Thank you in advance for helping prioritize & fund our backlog.

github-actions[bot] commented 5 months ago

Hello there, thank you for opening an Issue ! 🙏🏻 The team was notified and they will get back to you asap.

Yingrjimsch commented 5 months ago

Hello,

I could run it with an actual img with the following code

import torch
from torchvision.io import read_image
from screenai.main import ScreenAI

# Create a tensor for the image
image = read_image('test.png').unsqueeze(0).to(torch.float32)
# Create a tensor for the text
text = torch.randint(0, 20000, (1, 1028))

# Create an instance of the ScreenAI model with specified parameters
model = ScreenAI(
    num_tokens = 20000,
    max_seq_len = 1028,
    patch_size=16,
    image_size=224,
    dim=512,
    depth=6,
    heads=8,
    vit_depth=4,
    multi_modal_encoder_depth=4,
    llm_decoder_depth=4,
    mm_encoder_ff_mult=4,
)

# Perform forward pass of the model with the given text and image tensors
out = model(text, image)

# Print the shape of the output tensor
print(out)

and a test image which needs to be 224 x 224 pixels for example:

Maybe this helps.

JamshedAlamQaderi commented 5 months ago

@Yingrjimsch thank you so much for the help. Can you also tell me if i could input prompt text and encode it to tensor? how to do decode output tensor?

Yingrjimsch commented 5 months ago

Hi @JamshedAlamQaderi I had no time yet to try that but I would suggest use the Hugging Face transformer library to find a tokenizer. Use the tokenizer on your input text and set num_tokens as well as max_seq_length to the tokenizers specs. If I have time I'll try it as well and keep you updated.

Barney-Steven commented 5 months ago

Hi, @JamshedAlamQaderi , this repo is not the official Implementation, you can see the definition in "from screenai.main import ScreenAI", it is a very simple structure. ScreenAI is not open source for now. I find something similar in Huggingface, try moondream2.

JamshedAlamQaderi commented 5 months ago

Thank you guys for helping me