Investigate: compute embeddings via CoreML model

roblg commented 5 months ago

(Splitting out this discussion from #17; putting it here to document what I tried in case someone else wants to follow up)

I attempted to convert the gte-small model from HuggingFace from pytorch --> CoreML and integrated it into rem.

Attempt #1 just use the CoreML model that somebody uploaded to the HF repo a few weeks ago

Result: I was able to easily get a tokenizer imported via swift-transformers, and import the CoreML model, but the actual model prediction resulted in NaNs.

Attempt #2 convert the model myself using huggingface exporters project

Result: conversion fails in the validation phase, because it outputs NaNs... (see a pattern here? :) )

Attempt #3 manual conversion by following coremltools documentation

Result: kind of a few different things, but mostly: NaNs.

I'm unclear whether conversion of a pytorch model for embeddings specifically is something that's supported/intended by coremltools. They have a lot of models included that seem much more complicated than a BERT embedding model should be but :shrug:.

After a lot of poking and tweaking of inputs, I was able to get the pytorch model loaded into CoreML in fp16 format (it was defaulting to fp32 for some reason -- I think that's why the model uploaded to HF was so big to begin with). When I got to this point I get fp32 <--> fp16 compatibility issues from CoreML tools, which is a definite improvement, but... still not functional.

Error:

... snip ...
  File "/Users/robertgay/.pyenv/versions/exporters/lib/python3.10/site-packages/coremltools/converters/mil/mil/operation.py", line 190, in __init__
    self._validate_and_set_inputs(input_kv)
  File "/Users/robertgay/.pyenv/versions/exporters/lib/python3.10/site-packages/coremltools/converters/mil/mil/operation.py", line 503, in _validate_and_set_inputs
    self.input_spec.validate_inputs(self.name, self.op_type, input_kvs)
  File "/Users/robertgay/.pyenv/versions/exporters/lib/python3.10/site-packages/coremltools/converters/mil/mil/input_type.py", line 137, in validate_inputs
    raise ValueError(msg)
ValueError: In op, of type layer_norm, named input.5, the named input `epsilon` must have the same data type as the named input `gamma`. However, epsilon has dtype fp32 whereas gamma has dtype fp16.

Summary

So... I'm going to table this for now, given that there's already a more flexible/probably less finicky alternative (the rust lib + bindings). It was fun while it lasted, but there are only so many hours in the day. 😅

(feel free to close this, I just didn't want to carp up #17 given that there are ~3 discussions happening there right now.)

jasonjmcghee commented 5 months ago

Super appreciate the investigation!

Crazy that it's so difficult. Fwiw I found a bug in my thinking / was paying too much attention to c bindings and not enough to the embedding logic itself and forgot to take the mean 😅 but it's now fixed in https://github.com/jasonjmcghee/rust_embedding_lib

Still haven't taken the time to do the final step to make it a framework.

jasonjmcghee commented 5 months ago

I met the creator of https://github.com/unum-cloud/usearch today and they have a swift offering.

Could be another option instead of @sqlite-vss.

jasonjmcghee commented 5 months ago

Here's a model they built that supports images and text. https://huggingface.co/unum-cloud/uform-vl-english

jasonjmcghee commented 1 month ago

@roblg if you're interested in taking another shot at this, https://github.com/ashvardanian/SwiftSemanticSearch looks super promising!

jasonjmcghee / rem