Closed michaeljelly closed 8 months ago
Thanks 🙌 So far everything in this repo is only for text-generation (ie decoder-only). But I think you can follow the same patterns and I'm optimistic that you'd see good results with encoder embeddings.
Basically you want to:
In a bit more detail:
It looks like BGE uses the BERT architecture, so I would copy/paste a PyTorch implementation: this nano-BERT looks easy to work with though you will need to add a from_pretrained
method a la nanoGPT. huggingface's BERT implementation would work too. I would probably avoid huggingface/exporters since if you need to make any changes to the model (and you probably will to get the best perf), it will be difficult.
When you convert to CoreML, you'll get an .mlpackage file. Import that into your Xcode project (or use xcrun coremlcompiler generate
) to get Swift code. You can also call it from Python with coremltools
(like generate.py
in this repo).
A few things to look out for if you want the best performance:
Two last things if you're interested in doing this from Swift:
If you give it a shot, I'd love to hear how it goes!
Here is another Swift app. https://github.com/guinmoon/LLMFarm This guy has got it set up utilizing / converting from .GGUF format. Would be great if the models were utilizing the ANE. I havent ran tests to see if they are, but my guess is they arent because too big to run on ANE.
That's a cool project. Looks like it's based on llama.cpp so it's going to be running on CPU+GPU. The only way to run models on the ANE is via CoreML. There's no way to go straight from GGUF → CoreML to my knowledge, so you would have to start from PyTorch similar to how I mentioned above.
Even if you did that, I suspect you are correct that most of the models are on the edge of too large. Your best bet would be taking one of the "smaller" models (<= 3B or so) and quantizing it using CoreML palletization then running on iOS 17/macOS Sonoma (there's no runtime benefit to quantizing on earlier OSes).
I'm already using SimilaritySearchKit 😊
I was trying to find easy ways to implement embedding models in Swift. Managed to use pytorch and coremltools to convert to CoreML :)
Nice! I took a look at your PR, looks like your model is effectively ~100% ANE already which is pretty sweet. Would you mind sharing the PyTorch model? Kind of curious to tinker with it.
@smpanaro you have amazing knowledge of ANE & CoreML. Do you by chance view Apple's Machine Learning Research papers?
I have found a couple of notable papers that may assist in larger models running on ANE because of better computation and reducing size.
I posted these papers in this issue https://github.com/guinmoon/LLMFarm/issues/18
I'm currently looking into seeing how more ReLu activations may help models and how the dynamic embedding correlate to llama RoPE embeddings.
Here is another repo for Swift Vector Database (embeddings) https://github.com/Dripfarm/SVDB
Not sure if it's better than the SimilaritySearchKit.
Nice! I took a look at your PR, looks like your model is effectively ~100% ANE already which is pretty sweet. Would you mind sharing the PyTorch model? Kind of curious to tinker with it.
For sure! It's actually just the exact model from huggingface. Here's the code I used to grab the model and convert it.
import torch
from transformers import AutoTokenizer, AutoModel
import coremltools as ct
# Load a pre-trained version of MobileNetV2
# model = torchvision.models.mobilenet_v2(pretrained=True)
# load a local pytorch model
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5', model_max_length=512)
model = AutoModel.from_pretrained('BAAI/bge-small-en-v1.5', return_dict=False)
model.eval()
# # Set the model in evaluation mode.
# model.eval()
sentences = [" a", " b"]
# makes sure we have a long enough input to fill out the max input
sentence = ''.join([sentences[0]] * 510)
# if you want to convert the model such that you can use a batch size greater than 1, you can pass in an array of sentences
print(sentence)
# Trace the model with random data.
example_input = tokenizer(sentence, padding=True, return_tensors='pt')
print(example_input)
input_ids = example_input['input_ids']
print(input_ids.shape)
token_type_ids = example_input['token_type_ids']
attention_mask = example_input['attention_mask']
# Create a tuple of tensors
example_input_tuple = (input_ids, attention_mask)
# from collections import namedtuple
# ExampleInput = namedtuple('ExampleInput', ['input_ids', 'token_type_ids', 'attention_mask'])
# example_input = ExampleInput(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
# example_input = dict((x, y) for x, y in encoded_input)
# print(example_input)
traced_model = torch.jit.trace(model, example_input_tuple, strict=False)
# print(traced_model)
out = traced_model(input_ids, attention_mask)
print(out)
# Using image_input in the inputs parameter:
# Convert to Core ML program using the Unified Conversion API.
model = ct.convert(
traced_model,
convert_to="mlprogram",
inputs=[ct.TensorType(shape=input_ids.shape), ct.TensorType(shape=attention_mask.shape)]
)
# model = ct.converters.onnx.convert(model='my_model.onnx')
# Save the converted model.
model.save("model2.mlpackage")
Hope that helps you/someone @smpanaro ! Good luck with your project
Thanks @michaeljelly! I didn't realize you'd get such a clean model straight from huggingface, that's pretty sweet.
@antmikinka I have read a couple of them. Unfortunately most of the ones I've read or skimmed require training a model from scratch, which is a bit out of my wheelhouse currently. If you see any pre-trained models (e.g. on huggingface), you could try to convert them (the script above is a good place to start).
Otherwise, I think the best bet is to wait and see which ones Apple incorporates into coremltools/iOS next year (e.g. I believe this year's training-time quantization is based in part on this paper from last year).
@smpanaro Apple dropped some new repos: apple/ml-vision-transformers-ane and another one from their ML team ml-explore. Figure I would update you incase youre not. Would love to see what you may come up with. I'm still currently trying to understand the repos.
Anyway you could enable a discussions area on this repo?
@antmikinka I have seen MLX — seems like more of a PyTorch competitor, but will be watching to see if they add anything special for CoreML. I had not seen ml-vision-transformers-ane! I turned on discussions and will drop some thoughts about it there :)
Also going to close this issue since the original purpose seems resolved.
Local embeddings models are super helpful for app developers as it means data and embeddings can be computed and stored locally!
Would they already work? Or would they work with a bit more Swift/python conversion code?
Awesome work so far, king shit!