microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
319 stars 71 forks source link

Phi3 Vision models feedback and questions #571

Open AshD opened 1 month ago

AshD commented 1 month ago

The Phi3 vision model is excellent and does a great job in extracting text. I am using the CPU version via C# DirectML package.

  1. What is the max image filesize in kb that can be sent to the model? I saw image resolution of 1366x1366 but could not find the max file size anywhere?

  2. Waiting on the DirectML version of the onnx model. The CPU model is slow even when using the DirectML library on a 4090.

  3. We have a client windows app, Is it possible to have a nuget package that has the Cuda, DirectML and CPU support. That way, whatever version of the model the user has, the library can work with it.

  4. Does the onnxruntime_genai.models.builder support vision models now?

Thanks, Ash

natke commented 1 month ago

Hi @AshD! Great questions - we will get back to you with answers soon

baijumeswani commented 1 month ago
  1. What is the max image filesize in kb that can be sent to the model? I saw image resolution of 1366x1366 but could not find the max file size anywhere?

From what I can tell, there shouldn't be any limitation on the size of the image you can pass in. The image passed in are represented as placeholder tokens in the input_ids. So, the bigger the image, the longer the input_ids. This can be a limiting factor while running the model as the size of the input_ids must not exceed the context length of the model.

But there are no size limitations enforced by the code in ort-genai.

  1. Waiting on the DirectML version of the onnx model. The CPU model is slow even when using the DirectML library on a 4090.

We are hopeful to have this published soon. We encountered some memory issues when running the dml model for phi3-vision and needed to spend some more time ironing out the issues. But I am hopeful that the models and ort-genai package can be published sometime next week.

  1. We have a client windows app, Is it possible to have a nuget package that has the Cuda, DirectML and CPU support. That way, whatever version of the model the user has, the library can work with it.

This is a difficult problem to solve. For now, we do not have this support. Here are some of the problems that we need to iron out:

I am not sure if and when we can achieve this. But I'll add it to our wish list.

  1. Does the onnxruntime_genai.models.builder support vision models now?

As of now, we had to manually create export scripts to convert the model to onnx. Notably, the three onnx models are made in the following way:

We are looking to provide a way to automate this process. I am not sure if it will be in the model builder or another script we provide users. But this is something we plan to do.

AshD commented 1 month ago

Thanks @baijumeswani

  1. Our app, Fusion Quill, reduces the size of image to 100K and we found that worked well, except in cases where there was fine print on the image and extraction failed. Once we have the DML model, we can do some benchmarking on size of the image and time taken to extract information.
  2. I noticed the DirectML library has issues with entire 128K context of the Phi-3 model. Would love to see this fixed with using CPU memory when the GPU memory is not enough.
  3. Our app also uses llama.cpp and the way we resolve this is by detecting which version (CPU-AVX, CUDA) of the llama.dll is needed and call the Windows loadlibrary function with that DLL path.
natke commented 4 weeks ago

Hi @AshD, does that answer your questions for now?

AshD commented 3 weeks ago

Found one major issue with Phi-3 vision model. When I send a simple message without an image. It returns a </s>

<|user|>
 Hola <|end|>
<|assistant|>

I expected the vision model to behave as the regular phi-3-mini if no images is sent with the prompt.

baijumeswani commented 3 weeks ago

@AshD are you using phi3v.py? If you do not provide any image, it should just behave like the phi3 language model. What prompt did you use?

AshD commented 3 weeks ago

I am using the .NET DirectML nuget 0.3.0rc2 and CPU phi-3 vision model This works with the phi-3 mini model

Prompt was

<|user|>
 Hola <|end|>
<|assistant|>
AshD commented 3 weeks ago

The weird thing is if I change the prompt to make it generate JSON, it generates some JSON

<|user|>
You are chat bot that responds in JSON format.
 Hola <|end|>
<|assistant|>
baijumeswani commented 3 weeks ago

@AshD are you using phi3v.py? If you do not provide any image, it should just behave like the phi3 language model. What prompt did you use?

I need to make a small correction here. Although the text model is the same as the phi3 text model in case of no image is provided, the processor uses a different tokenizer. I believe the response discrepancy is coming from the tokenizer being different between the phi3 mini language model and the phi3 vision model.

AshD commented 3 weeks ago

Can this be fixed. We like this model :-)

baijumeswani commented 3 weeks ago

Can this be fixed. We like this model :-)

Ok I looked into it a bit. It seems like the text model in phi3-vision has different weights compared to the text model in phi3 mini 128k. This is where the difference is coming from. Since the model weights come from the original huggingface pytorch model, there isn't much we can do to address this.

AshD commented 3 weeks ago

Is this a problem with the transformers version of the model too? If so, can you report it to them. The expectation is that a multi-modal model would handle both text and images or only text.

Thanks, Ash

SloDamn commented 6 days ago

Hello @baijumeswani, Any updates on directML support for Phi 3 Vision?

baijumeswani commented 5 days ago

@SloDamn our dml package already supports the phi3 vision model. But we have not made the dml variant of the model public since it has not been rai validated yet. We hope to make the dml phi3 vision model public this week.

sridhar21111976 commented 2 days ago

I am trying the phi3 vision onnx model for local inference... with Image and prompt input. (no GPU hardware) The response generation takes a long time.. I am on a i7 10 core HP laptop with 16g memory.

Tried with different resolution of the image and better hardware, no change...Any recommendations...Each response takes more than 2-3 minutes