nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
https://nomic.ai/gpt4all
MIT License
69.1k stars 7.59k forks source link

Support for multilingual embeddings in Embed4All #1310

Open handshape opened 1 year ago

handshape commented 1 year ago

Feature request

At present, Embed4All in the Python bindings is pinned to use ggml-all-MiniLM-L6-v2-f16, and it works brilliantly. It would be tremendously helpful to have at least one (aligned) multilingual embedding available.

Motivation

I'm Canadian, and our country is bilingual. My applications frequently involve semantic searches across user-submitted content, and being able to return a French result for an English query (and vice-versa) would be a big boon.

I've hand-cranked my model using sentence-transformers' distiluse-base-multilingual-cased-v1, but a quantized (smaller) version with a proper Embed4All binding would shrink my app footprint considerably.

Your contribution

I'm more than willing to help test a branch and contribute tests, but don't have the expertise necessary to convert the tokenizers, nor to properly quantize the model.

zwilch commented 7 months ago

Is there a way to convert sentence-transformers paraphrase-multilingual-MiniLM-L12-v2 https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2/tree/main to gguf and to load it? I need other languages for LocalDocs embeddings.