Closed shortcipher3 closed 2 years ago
Hey @shortcipher3! I haven't really played around with the text model yet sorry, but it's handy that you've documented this so that when I or others get around to testing we can compare to your results. I've been meaning to finish off the image model (should just require normalisation as mentioned in the readme todo), but it might be worth testing that in Python too - as it could be that there's something wrong with the conversion process in general (not just in the text case).
I may have misled you (and maybe others) into thinking that this repo is in more of a "ready to use" state than it is, so I've added "WIP" to the repo title which will hopefully help a little. If you do make any progress on this, or so other tests, please do share your findings here. Also might be worth testing the onnx and TF Saved Models to see where the output diverges from the original Pytorch models. The tflite is last in the conversion chain, so I guess the problem could be at an earlier point.
Thanks for the reply.
FYI: I have tested the image output and I am getting the same results. Here is my test code for your reference:
import numpy as np
import tensorflow as tf
import cv2
image_model_path = 'clip-image-vit-32.tflite'
# Load TFLite model and allocate tensors.
image_interpreter = tf.lite.Interpreter(model_path=image_model_path)
image_interpreter.allocate_tensors()
# Get input and output tensors.
image_input_details = image_interpreter.get_input_details()
image_output_details = image_interpreter.get_output_details()
import os
import skimage
import IPython.display
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
from collections import OrderedDict
import torch
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# images in skimage to use and their textual descriptions
descriptions = {
"page": "a page of text about segmentation",
"chelsea": "a facial photo of a tabby cat",
"astronaut": "a portrait of an astronaut with the American flag",
"rocket": "a rocket standing on a launchpad",
"motorcycle_right": "a red motorcycle standing in a garage",
"camera": "a person looking at a camera on a tripod",
"horse": "a black-and-white silhouette of a horse",
"coffee": "a cup of coffee on a saucer"
}
# apples to apples comparison with clip
import clip
pytorch_images=[]
for filename in [filename for filename in os.listdir(skimage.data_dir) if filename.endswith(".png") or filename.endswith(".jpg")]:
name = os.path.splitext(filename)[0]
if name not in descriptions:
continue
image = Image.open(os.path.join(skimage.data_dir, filename)).convert("RGB")
pytorch_images.append(preprocess(image))
# use only the first image, since tflite only supports batch size of 1
image_input = torch.tensor(pytorch_images[0][np.newaxis, :, :, :])
pytorch_image_features = model.encode_image(image_input).float()
print(f'pt image features {pytorch_image_features[0, :10]}')
tf_image_input = np.array(image_input.tolist()).astype(np.float16)
image_interpreter.set_tensor(image_input_details[0]['index'], tf_image_input)
image_interpreter.invoke()
tf_image_features = image_interpreter.get_tensor(image_output_details[0]['index'])[0]
print(f'tf image features {tf_image_features[:10]}')
The output is:
pt image features tensor([ 0.1388, 0.5618, -0.1767, -0.1225, 0.2445, 0.0420, 0.8171, 0.4761,
-0.1947, -0.7773], grad_fn=<SliceBackward0>)
tf image features [ 0.139 0.561 -0.1775 -0.12384 0.2452 0.0402 0.817 0.4783
-0.1969 -0.778 ]
LGTM!
I also just tested with the Resnet 50 Clip model, both text and image results match the pytorch results!
For the image model conversion had to change it to a 32 bit model to get the conversion tool to take.
Just adding a note here that I've added the proper normalisation to the image encoder model, and confirmed that it produces the same embedding as Pytorch when using the ONNX Runtime Web (wasm backend - as in the demo). This doesn't really help with tflite+ViT stuff, but it's at least another data point to help others debug the conversion issues.
For the text model I'm blocked by some conversion issues which I haven't had time to sit down and properly debug yet.
Closing this for now. Both the text and image model are now working properly with the ONNX runtime.
Awesome project!
I'm trying to use the tflite model that comes out of the conversion, but it's output doesn't look the same as the original model.
After converting to tflite, I use the data provided in the openai pytorch example:
which gives output of:
For comparison the tutorial notebook https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb
Runs the same calculation:
which gives output of: