minthemiddle / llm-embed-jina

Embedding models from Jina AI
Apache License 2.0
0 stars 0 forks source link

Bug: Embeddings not available yet #1

Open minthemiddle opened 7 months ago

minthemiddle commented 7 months ago

The actual embeddings are not available yet. When you run the example in the repo directly:

from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True) # trust_remote_code is needed to use the encode method
embeddings = model.encode(['How is the weather today?', 'Wie ist das Wetter heute?'])
print(cos_sim(embeddings[0], embeddings[1]))

You get this error:

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/jinaai/jina-embeddings-v2-base-de/resolve/main/config.json

I will follow the files tab in HF to find out when embeddings are being made available.

minthemiddle commented 7 months ago

Here are the changes I prepared so far:

diff --git a/llm_embed_jina.py b/llm_embed_jina.py
index 789b5b1..e2b0beb 100644
--- a/llm_embed_jina.py
+++ b/llm_embed_jina.py
@@ -10,6 +10,7 @@ def register_embedding_models(register):
         "jina-embeddings-v2-small-en",
         "jina-embeddings-v2-base-en",
         "jina-embeddings-v2-large-en",
+        "jinaai/jina-embeddings-v2-base-de",
     ):
         register(JinaEmbeddingModel(model_id))

diff --git a/tests/test_embed_jina.py b/tests/test_embed_jina.py
index 9450ed0..9b4ff95 100644
--- a/tests/test_embed_jina.py
+++ b/tests/test_embed_jina.py
@@ -9,6 +9,12 @@ def test_jina_embed_small():
     assert len(floats) == 512
     assert all(isinstance(f, float) for f in floats)

+def test_jina_embed_german():
+    model = llm.get_embedding_model("jina-embeddings-v2-base-de")
+    floats = model.embed("hallo welt")
+    assert len(floats) == 768
+    assert all(isinstance(f, float) for f in floats)
+

 def test_jina_embed_long_string():
     model = llm.get_embedding_model("jina-embeddings-v2-small-en")
minthemiddle commented 7 months ago

Also some planned changes to README.md:

diff --git a/README.md b/README.md
index dabe72f..3d24b50 100644
--- a/README.md
+++ b/README.md
@@ -23,12 +23,14 @@ Install this plugin in the same environment as [LLM](https://llm.datasette.io/).

 ## Usage

-This plugin adds support for three new embedding models:
+This plugin adds support for four new embedding models:

 - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
 - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
 - [`jina-embeddings-v2-large-en`](https://huggingface.co/jinaai/jina-embeddings-v2-large-en): 435 million parameters - not yet released, but it will work once it has been released.

+- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): XXX million parameters - not yet released, but it will work once it has been released.
+
 The models will be downloaded the first time you try to use them.

 See [the LLM documentation](https://llm.datasette.io/en/stable/embeddings/index.html) for everything you can do.

I could not find any reference to the model parameter size.

minthemiddle commented 7 months ago

About the open-source release date:

We will make this model available in the AWS Sagemaker marketplace for Amazon cloud users and for download on HuggingFace very soon.

Source: Release News

minthemiddle commented 7 months ago

There are now files, but only when you are logged into Huggingface (HF). To run the model locally, this is what worked:

Then a script like the following works:

from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True) # trust_remote_code is needed to use the encode method
embeddings = model.encode(['How is the weather today?', 'Wie ist das Wetter heute?', 'Wie heißt dieser Hund?'])
print(cos_sim(embeddings[0], embeddings[1]))
print(cos_sim(embeddings[0], embeddings[2]))

This returns an expected cosine similarity:

0.9602111
0.06511246

On my Apple M2, 2022, 24 GB Ram, this script took ~3s when running with time python3 script.py.

python3 jina-embeddings-de-en.py 3,13s user 3,03s system 162% cpu 3,799 total

minthemiddle commented 7 months ago

I confirmed the dimensionality of the embedding (768):

import numpy as np

embeddings = model.encode(['How is the weather today?', 'Wie ist das Wetter heute?', 'Wie heißt dieser Hund?'])

dimensionality = np.array(embeddings[0])

print(f'Dimensions: {dimensionality.shape}')
minthemiddle commented 7 months ago

On a 2018 Mac Mini (i3, 8GB Ram), this is considerably slower:

print(cos_sim(embeddings[0], embeddings[1]))

python3 jina-embed-de-en.py 7,36s user 11,38s system 33% cpu 55,454 total

Nearly 1 minute for 2 embeddings.