unum-cloud / uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
https://unum-cloud.github.io/uform/
Apache License 2.0
982 stars 56 forks source link

add PreProcessor for VLM #57

Open wnma3mz opened 6 months ago

ashvardanian commented 6 months ago

Thank you for contributions, @wnma3mz! Detaching the preprocessing code is probably the right thing to do. Give us a couple of days to merge it 🤗

VoVoR commented 6 months ago

@wnma3mz hey,

We appreciate your work on the PR!

I wanted to ask you to remove the changes from src/ dir and keep all the updates in scripts that are useful for onnx/coreml runtimes. We are using scr/ together with our pre-training code. So we didn't want to update it frequently. We know it will be great to separate preprocessing and modeling into different classes, and did it already. You can expect it in the next release in a few weeks. But we did it in a different way a little bit.

Also, as far as I understand, you tested your script with model_fpath = "unum-cloud/uform-coreml-onnx," correct?

wnma3mz commented 6 months ago

Thanks for your reply, I have deleted the changes in the src directory.

As you said, I tested it at scripts/example.py. Therefore, this part of the code will have an impact. When you push a new preprocessing, feel free to remind me to update scripts/example.py to make sure it works correctly.

VoVoR commented 6 months ago

@wnma3mz hi I've tested the exmaple.py script with "model_fpath = 'unum-cloud/uform-coreml-onnx'" - it didn't work. And it shouldn't because "get_model" won't work with our coreml/onnx HF model card. How did you exactly run the script? Can you push the working version by any chance so I can check it?

wnma3mz commented 6 months ago

@VoVoR

I'm sorry for the trouble. For the convenience of testing, I downloaded all the model file locally in advance. The file structure is as follows:

├── multilingual-v2.image-encoder.mlpackage
│   ├── Data
│   │   └── com.apple.CoreML
│   │       ├── model.mlmodel
│   │       └── weights
│   │           └── weight.bin
│   └── Manifest.json
├── multilingual-v2.image-encoder.mlpackage.zip
├── multilingual-v2.text-encoder.mlpackage
│   ├── Data
│   │   └── com.apple.CoreML
│   │       ├── model.mlmodel
│   │       └── weights
│   │           └── weight.bin
│   └── Manifest.json
├── multilingual-v2.text-encoder.mlpackage.zip
├── multilingual.image-encoder.onnx
├── multilingual.text-encoder.onnx
├── tokenizer.json
├── torch_config.json
└── torch_weight.pt

The current 'snapshot_download' function can interfere with testing due to network reasons, so I added the 'get_local_model' function for easy run.

If you have any other questions, please feel free to remind me