How straightforward is it to run embeddings models in swift, like bge or gte series models

michaeljelly commented 9 months ago

Local embeddings models are super helpful for app developers as it means data and embeddings can be computed and stored locally!

Would they already work? Or would they work with a bit more Swift/python conversion code?

Awesome work so far, king shit!

smpanaro commented 9 months ago

Thanks 🙌 So far everything in this repo is only for text-generation (ie decoder-only). But I think you can follow the same patterns and I'm optimistic that you'd see good results with encoder embeddings.

Basically you want to:

Find a PyTorch implementation of your embedding model.
Convert it using coremltools, profile it in Xcode, and look for slow parts.
If there are slow parts, edit the PyTorch model and go back one step.
Use it in your app.

In a bit more detail:

It looks like BGE uses the BERT architecture, so I would copy/paste a PyTorch implementation: this nano-BERT looks easy to work with though you will need to add a from_pretrained method a la nanoGPT. huggingface's BERT implementation would work too. I would probably avoid huggingface/exporters since if you need to make any changes to the model (and you probably will to get the best perf), it will be difficult.

When you convert to CoreML, you'll get an .mlpackage file. Import that into your Xcode project (or use xcrun coremlcompiler generate) to get Swift code. You can also call it from Python with coremltools (like generate.py in this repo).

A few things to look out for if you want the best performance:

Profile the model in Xcode. There will be a couple layers at the beginning that run on CPU (the vocab embedding), but the rest should run on GPU or Neural Engine (depending on which compute unit you chose). You want to avoid switching between CPU and non-CPU, that's really slow.
- If you are switching back and forth, see if you can change the PyTorch model to avoid it (usually something is dynamic that doesn't need to be). Happy to take a look if you get stuck here.
If you have large tensors (e.g. you see a lot of float16<>float32 casting, or a long time copying inputs/outputs when you open the Xcode profile in Instruments):
- Use float16 inputs/outputs instead of float32.
- Use CVPixelBuffer-backed MLMultiArrays (either write something like this in Swift, or in Python you can use my custom version of coremltools as mentioned in SETUP.md).
If you want to embed a bunch of documents at once, consider converting the model with a larger batch size. You should see much higher throughput compared to doing one at a time.
- You can try to use coremltools' enumerated inputs if you want to sometimes infer batches and sometimes infer just one.

Two last things if you're interested in doing this from Swift:

You'll need something to tokenize your inputs—huggingface/swift-transformers looks promising but I haven't used it.
I ran across this repo the other day: SimilaritySearchKit. I don't think the models are as optimized as they could be, but it looks like you could add your own. It does a bunch of the standard embedding search legwork which is nice.

If you give it a shot, I'd love to hear how it goes!

antmikinka commented 9 months ago

Here is another Swift app. https://github.com/guinmoon/LLMFarm This guy has got it set up utilizing / converting from .GGUF format. Would be great if the models were utilizing the ANE. I havent ran tests to see if they are, but my guess is they arent because too big to run on ANE.

smpanaro commented 9 months ago

That's a cool project. Looks like it's based on llama.cpp so it's going to be running on CPU+GPU. The only way to run models on the ANE is via CoreML. There's no way to go straight from GGUF → CoreML to my knowledge, so you would have to start from PyTorch similar to how I mentioned above.

Even if you did that, I suspect you are correct that most of the models are on the edge of too large. Your best bet would be taking one of the "smaller" models (<= 3B or so) and quantizing it using CoreML palletization then running on iOS 17/macOS Sonoma (there's no runtime benefit to quantizing on earlier OSes).

michaeljelly commented 9 months ago

I'm already using SimilaritySearchKit 😊

I was trying to find easy ways to implement embedding models in Swift. Managed to use pytorch and coremltools to convert to CoreML :)

smpanaro commented 9 months ago

Nice! I took a look at your PR, looks like your model is effectively ~100% ANE already which is pretty sweet. Would you mind sharing the PyTorch model? Kind of curious to tinker with it.

antmikinka commented 9 months ago

@smpanaro you have amazing knowledge of ANE & CoreML. Do you by chance view Apple's Machine Learning Research papers?

I have found a couple of notable papers that may assist in larger models running on ANE because of better computation and reducing size.

I posted these papers in this issue https://github.com/guinmoon/LLMFarm/issues/18

I'm currently looking into seeing how more ReLu activations may help models and how the dynamic embedding correlate to llama RoPE embeddings.

Here is another repo for Swift Vector Database (embeddings) https://github.com/Dripfarm/SVDB

Not sure if it's better than the SimilaritySearchKit.

michaeljelly commented 9 months ago

Nice! I took a look at your PR, looks like your model is effectively ~100% ANE already which is pretty sweet. Would you mind sharing the PyTorch model? Kind of curious to tinker with it.

For sure! It's actually just the exact model from huggingface. Here's the code I used to grab the model and convert it.

import torch

from transformers import AutoTokenizer, AutoModel
import coremltools as ct
# Load a pre-trained version of MobileNetV2
# model = torchvision.models.mobilenet_v2(pretrained=True)
# load a local pytorch model
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5', model_max_length=512)
model = AutoModel.from_pretrained('BAAI/bge-small-en-v1.5', return_dict=False)
model.eval()

# # Set the model in evaluation mode.

# model.eval()
sentences = [" a", " b"]

# makes sure we have a long enough input to fill out the max input
sentence = ''.join([sentences[0]] * 510) 
# if you want to convert the model such that you can use a batch size greater than 1, you can pass in an array of sentences
print(sentence)
# Trace the model with random data.
example_input = tokenizer(sentence, padding=True, return_tensors='pt')
print(example_input)
input_ids = example_input['input_ids']
print(input_ids.shape)
token_type_ids = example_input['token_type_ids']
attention_mask = example_input['attention_mask']

# Create a tuple of tensors
example_input_tuple = (input_ids,  attention_mask)
# from collections import namedtuple

# ExampleInput = namedtuple('ExampleInput', ['input_ids', 'token_type_ids', 'attention_mask'])
# example_input = ExampleInput(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)

# example_input = dict((x, y) for x, y in encoded_input)
# print(example_input)
traced_model = torch.jit.trace(model, example_input_tuple, strict=False)
# print(traced_model)
out = traced_model(input_ids, attention_mask)
print(out)

# Using image_input in the inputs parameter:
# Convert to Core ML program using the Unified Conversion API.
model = ct.convert(
    traced_model,
    convert_to="mlprogram",
    inputs=[ct.TensorType(shape=input_ids.shape), ct.TensorType(shape=attention_mask.shape)]

 )
# model = ct.converters.onnx.convert(model='my_model.onnx')
 # Save the converted model.
model.save("model2.mlpackage")

michaeljelly commented 9 months ago

Hope that helps you/someone @smpanaro ! Good luck with your project

smpanaro commented 9 months ago

Thanks @michaeljelly! I didn't realize you'd get such a clean model straight from huggingface, that's pretty sweet.

smpanaro commented 9 months ago

@antmikinka I have read a couple of them. Unfortunately most of the ones I've read or skimmed require training a model from scratch, which is a bit out of my wheelhouse currently. If you see any pre-trained models (e.g. on huggingface), you could try to convert them (the script above is a good place to start).

Otherwise, I think the best bet is to wait and see which ones Apple incorporates into coremltools/iOS next year (e.g. I believe this year's training-time quantization is based in part on this paper from last year).

antmikinka commented 8 months ago

@smpanaro Apple dropped some new repos: apple/ml-vision-transformers-ane and another one from their ML team ml-explore. Figure I would update you incase youre not. Would love to see what you may come up with. I'm still currently trying to understand the repos.

Anyway you could enable a discussions area on this repo?

smpanaro commented 8 months ago

@antmikinka I have seen MLX — seems like more of a PyTorch competitor, but will be watching to see if they add anything special for CoreML. I had not seen ml-vision-transformers-ane! I turned on discussions and will drop some thoughts about it there :)

Also going to close this issue since the original purpose seems resolved.

smpanaro / more-ane-transformers

How straightforward is it to run embeddings models in swift, like bge or gte series models #2