Phi3 Vision models feedback and questions

AshD commented 6 months ago

The Phi3 vision model is excellent and does a great job in extracting text. I am using the CPU version via C# DirectML package.

What is the max image filesize in kb that can be sent to the model? I saw image resolution of 1366x1366 but could not find the max file size anywhere?
Waiting on the DirectML version of the onnx model. The CPU model is slow even when using the DirectML library on a 4090.
We have a client windows app, Is it possible to have a nuget package that has the Cuda, DirectML and CPU support. That way, whatever version of the model the user has, the library can work with it.
Does the onnxruntime_genai.models.builder support vision models now?

Thanks, Ash

natke commented 6 months ago

Hi @AshD! Great questions - we will get back to you with answers soon

baijumeswani commented 5 months ago

What is the max image filesize in kb that can be sent to the model? I saw image resolution of 1366x1366 but could not find the max file size anywhere?

From what I can tell, there shouldn't be any limitation on the size of the image you can pass in. The image passed in are represented as placeholder tokens in the input_ids. So, the bigger the image, the longer the input_ids. This can be a limiting factor while running the model as the size of the input_ids must not exceed the context length of the model.

But there are no size limitations enforced by the code in ort-genai.

Waiting on the DirectML version of the onnx model. The CPU model is slow even when using the DirectML library on a 4090.

We are hopeful to have this published soon. We encountered some memory issues when running the dml model for phi3-vision and needed to spend some more time ironing out the issues. But I am hopeful that the models and ort-genai package can be published sometime next week.

We have a client windows app, Is it possible to have a nuget package that has the Cuda, DirectML and CPU support. That way, whatever version of the model the user has, the library can work with it.

This is a difficult problem to solve. For now, we do not have this support. Here are some of the problems that we need to iron out:

onnxruntime.dll/.so is built differently for dml vs cpu/cuda. We need to find a way to have the same ort shared lib for all scenarios.
Code inside ort-genai has the same problem where device specific code is hidden behind ifdef macros. We use different compilers and libraries to build for different hardware. Having an all in one package would require us to create multiple dlls/so one each for the device used.
Package size can also become a problem.

I am not sure if and when we can achieve this. But I'll add it to our wish list.

Does the onnxruntime_genai.models.builder support vision models now?

As of now, we had to manually create export scripts to convert the model to onnx. Notably, the three onnx models are made in the following way:

Vision model: exported using torch.onnx.export with a few changes to the original pytorch code.
Embedding model: handcrafted using onnx helpers.
Text generation model: built using the model builder.

We are looking to provide a way to automate this process. I am not sure if it will be in the model builder or another script we provide users. But this is something we plan to do.