Currently converting VIT-L/14 model

ghost commented 2 years ago

Hi, I am currently trying to convert the VIT L/14 model, but running into some memory issue when I try to load the model in the ONNX web runtime. Do you have any ideas? I might have to just wait for it to be quantized to INT8.

Thanks,

josephrocca commented 2 years ago

Are you seeing a runtime error like Uncaught (in promise) 1991937888? If so, and if the number is close to 2 billion like in that example, then you're likely running into this issue: https://github.com/microsoft/onnxruntime/issues/10957

Apparently it's possible to build with max memory limit set to 4GB instead of 2GB as mentioned here: https://github.com/microsoft/onnxruntime/issues/10957#issuecomment-1074397486

Please comment on that issue, briefly explaining your use case to help motivate the ORT Web authors to raise the limit to 4GB.

I think we're waiting for this wasm proposal to allow more than 4GB memory.

Also note that I did briefly try to quantize the models using the ONNX tooling, but for some reason it wasn't working. From the README of this repo:

The model files are about 4x larger than they actually need to be - params are float32 instead of uint8. If you're using CLIP in a "real" web app, you should probably quantize it. @minimaxir has done it (1, 2), and that model worked first try with ORT Web (which is amazing), but it outputs a 768 element vector instead of 512, which I think is because @minimaxir's model is missing the final projection head which puts image embeddings into same-sized space as text embeddings. I had a quick attempt at it in the ONNX export notebook (see cell after ONNX conversion), but it doesn't seem to be working. If you investigate this and get it working, please open an issue. Thanks to @congraIiIso on Twitter for bringing the uint8 quantization to my attention!

The code to quantize is actually really simple:

from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("clip-image-vit-32.onnx", "clip-image-vit-32-uint8.onnx", weight_type=QuantType.QUInt8, extra_options={"MatMulConstBOnly":False})

Again, I only briefly tried to get it to work - if you have time to experiment then I'd be interested to see whether you manage to get quantization working.

ghost commented 2 years ago

Thanks I will try out quantization, also will reach out to ONNX to see if its supported.

ghost commented 2 years ago

Getting Uncaught (in promise) 621137752 with quantization, the imgbedding model without final projection head works fine without any memory issue so I doubt it has anything to do with memory. Maybe its older version of ONNX producing different ops etc.

ghost commented 2 years ago

I now think that this is a likely onnx export bug since the uploaded imgbedding model works fine but if I try to re-export it with same config it fails. I am gonna try downgrading onnx exporter and see if that works.

ghost commented 2 years ago

Your quantization code is correct the error is in the jsbin. You just need to use the latest version of ORT

https://cdn.jsdelivr.net/npm/onnxruntime-web@1.12.0/dist/ort.js

I was able to build a 291.9 MB VIT L-14 quantized model that produces correct embedding. Thanks again for your effort in openai-clip-js and the clip_sorter repo 🎉 .

josephrocca commented 2 years ago

@SuperVisualApp Great to hear you got quantization working! Thanks for sharing your work/progress on this. I am a bit confused though: Which jsbin are you referring to? I might be misremembering, but I thought the jsbins that I linked were working fine (using minimaxir's models), but the problem was that they output a 768 dim vector (instead of 512) because they were missing the projection head. And when I tried to do the conversion in this notebook, it produced a model that's has the same file size as the original?

ghost commented 2 years ago

I used the quantized weights produced by your notebook. The only change was in the JSBin that I used to verify if it was correct.

ghost commented 2 years ago

The final model is definitely smaller. The unquantized version is 580Mb.

josephrocca commented 2 years ago

@SuperVisualApp Oh, weird. I just opened up the notebook, clicked "Runtime > Run all" without making any changes, and then checked the output and it doesn't reduce the file size for me (both files are ~167mb). I also tried switching to ViT-L/14 (from ViT-B/32) and it has the same problem - uint8 output is ~580mb. Could you share the notebook that you're using? Perhaps you made some slight changes?

ghost commented 2 years ago

Ah it turns out that it was not the model exported using your notebook but rather the Jina AI CLIP as a Service

https://clip-as-service.s3.us-east-2.amazonaws.com/models-436c69702d61732d53657276696365/onnx/ViT-L-14/visual.onnx

If you quantized that one it works correctly and produce 292MB model.

josephrocca commented 2 years ago

@SuperVisualApp Ah okay, thanks!

josephrocca commented 2 years ago

Okay, very strange, this works (using the original onnx files rather than producing new ones with the Export_CLIP_to_ONNX_tflite_tfjs_tf_saved_model.ipynb notebook in this repo):

!pip install onnxruntime
!pip install onnx
!wget https://huggingface.co/rocca/openai-clip-js/resolve/main/clip-image-vit-32-float32.onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("clip-image-vit-32-float32.onnx", "clip-image-vit-32-uint8.onnx", weight_type=QuantType.QUInt8)

So here are the quantized models:

I'm guessing something changed in ONNX/PyTorch since I last exported. Oh, actually, I just realised while typing this comment: It's probably related to the post-conversion float16-to-float32 stuff that I had to do, mentioned in a bullet-point in the readme.

So the conversion now works without any errors, and the file sizes are a quarter of the size, as expected, but their embeddings seem to be inaccurate - the results are noticeably worse than the normal models when testing it with the clip-image-sorter demo. I've added these quantized models to the demos in this repo and to the clip-image-sorter for testing in any case.

I'll also linked this comment from the readme in case anyone wants to explore this further.

ghost commented 2 years ago

You can potentially compare it with quantized JINA ai onnx exported model to see if its better.

ghost commented 2 years ago

You can try out what I am building with CLIP running in the browser here: https://www.supervisual.app/

josephrocca commented 2 years ago

Very cool!

josephrocca / openai-clip-js

Currently converting VIT-L/14 model #3