clip-ViT-B-32-multilingual-v1 support, ps: I can contribute.

yaman commented 1 year ago

I exported clip-ViT-B-32-multilingual-v1 to onnx with some modifications(no effect on the output embedding).

hf optimum onnx export can export this model with (0) Transformer and (1) Pooling. But it can not extend with provided dense layer. What I have done is, I created a model that combines 3 layers as follows;

CombinedModel

from sentence_transformers import SentenceTransformer
from sentence_transformers import models
import torch
import torch.nn as nn
import onnx
import numpy as np

class CombinedModel(nn.Module):
    def __init__(self, transformer_model, dense_model):
        super(CombinedModel, self).__init__()
        self.transformer = transformer_model
        self.dense = dense_model

    def forward(self, input_ids, attention_mask):
        outputs = self.transformer({'input_ids': input_ids, 'attention_mask': attention_mask})
        token_embeddings = outputs['token_embeddings']
        dense_output = self.dense({'sentence_embedding': token_embeddings})
        dense_output_tensor = dense_output['sentence_embedding']

        ### this was important for me. it took me a bit to figure out that original model takes the mean of dense output
        mean_output = torch.mean(dense_output_tensor, dim=1)
        flattened_output = mean_output.squeeze(0)
        return flattened_output

Combine dense with original model

transformer_model = SentenceTransformer('clip-ViT-B-32-multilingual-v1', cache_folder='model_pytorch')
tokenizer = transformer_model.tokenizer

### this is from dense model configuration
dense_model = models.Dense(
    in_features=768,
    out_features=512,
    bias=False,
    activation_function= nn.Identity()
)

### load the weights from dense model binary
state_dict = torch.load('model_pytorch/sentence-transformers_clip-ViT-B-32-multilingual-v1/2_Dense/pytorch_model.bin')
dense_model.load_state_dict(state_dict)

model = CombinedModel(transformer_model, dense_model)

Export combined model to onnx

model.eval()

input_text = "This is a multi-lingual version of the OpenAI CLIP-ViT-B32 model. You can map text (in 50+ languages) and images to a common dense vector space such that images and the matching texts are close."

inputs = tokenizer(input_text, padding='longest', truncation=True, max_length=128, return_tensors='pt')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# Export the model
torch.onnx.export(model,               # model being run
                  (input_ids, attention_mask), # model input (or a tuple for multiple inputs)
                  "combined_model.onnx", # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=17,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input_ids', 'attention_mask'],   # the model's input names
                  output_names = ['output'], # the model's output names
                  dynamic_axes={'input_ids': {0 : 'batch_size', 1: 'seq_length'},    # variable length axes
                                'attention_mask': {0 : 'batch_size', 1: 'seq_length'},
                                'output' : {0 : 'batch_size'}})

onnx.checker.check_model("combined_model.onnx")
comdined_model = onnx.load("combined_model.onnx")

Compare both original and onnx model output;

import torch
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

model = SentenceTransformer('sentence-transformers/clip-ViT-B-32-multilingual-v1')
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/clip-ViT-B-32-multilingual-v1')

# Prepare the input
text = "This is an example sentence."
inputs = tokenizer(text, padding='longest', truncation=True, max_length=128, return_tensors='pt')

# Run the PyTorch model
pytorch_output =  model.encode(text, convert_to_tensor=True, device='cpu')

# Convert the inputs to numpy arrays for the ONNX model
inputs_onnx = {name: tensor.numpy() for name, tensor in inputs.items()}

# Run the ONNX model
sess = ort.InferenceSession("combined_model.onnx")
onnx_output = sess.run(None, inputs_onnx)

# Compare the outputs
print("Are the outputs close?", np.allclose(pytorch_output.detach().numpy(), onnx_output[0], atol=1e-6))

# Calculate the differences between the outputs
differences = pytorch_output.detach().numpy() - onnx_output[0]

# Print the standard deviation of the differences
print("Standard deviation of the differences:", np.std(differences))

print("pytorch_output size:", pytorch_output.size())
print("onnx_output size:", onnx_output[0].shape)

Output:

Are the outputs close? True
Standard deviation of the differences: 1.6167593e-07
pytorch_output size: torch.Size([512])
onnx_output size: (512,)

I would really like to contribute the onnx model, novices like me can use the onnx version easily. I did not find any CONTRIBUTIONS guide, however, I can contribute the model with your directions.

yaman commented 11 months ago

Anyone in the void?

NirantK commented 11 months ago

Hey @yaman ! Sorry, I was away from the project.

Would love to have this! This is quite a neat workaround!

Can you push the ONNX Model weights to Huggingface Hub and raise a PR with that? That way, you always retain the attribution for doing the ONNX export.

I can help you get started with both. Here is a calendar link if that's easier? https://cal.com/nirant-kasliwal-qdrant/30min

yaman commented 11 months ago

Hi @NirantK,

Sorry for late reply, I caught flu and it knocked me out.

Let me give latest updates;

After following the issue with friends from hf-optimum team, my workaround is not necessary anymore, team fixed the problem with https://github.com/huggingface/optimum/issues/1519 on their main branch(though might not be released yet).

I have already created hf repo(https://huggingface.co/canavar/clip-ViT-B-32-multilingual-v1-ONNX) but I was waiting a response from model owners to push to original model repository(if possible) but no luck. I will upload the onnx version of the model to my hf repo and let you know.

thanks

yaman commented 11 months ago

Hi @NirantK again,

I pushed the model to https://huggingface.co/canavar/clip-ViT-B-32-multilingual-v1-ONNX. Do you want me to raise a pr to fastembed repo?

NirantK commented 11 months ago

I'd love it if you can PR it! That'll go much faster!

joein commented 5 months ago

We've added Image embedding support (including CLIP) in v0.3.0, not a multilingual version yet though

qdrant / fastembed