Closed alihassan0 closed 9 months ago
If you have an auto-embedding field in an empty collection, and then import documents with the same embedding field already set in your documents, then Typesense won't regenerate embeddings for those documents.
Then when you send a search query, only the query will use the built-in embedding model.
Thanks a lot for the suggestion .. we tried this and it seemed to work.
I wonder if there could be an alternative solution to generate the embedding though that doesn't require the to spawn a full typesense server. but I am not sure how to do it exactly !
alternative solution to generate the embedding though that doesn't require the to spawn a full typesense server
You could use a Python script like this to generate embeddings externally.
I have tried using this route as well.. but I wasn't able to figure out how to use the same hugging face model in the python script. I have trying using it like this
model = SentenceTransformer('all-MiniLM-L12-v2')
but it produces totally different embeddings. any idea how to fix this ?
Hello @jasonbosco ,
I tried a few more things and I still can't get it product similar embeddings. can you should some light on how to replicate embedding generation similar to typesense auto embeddings but without spawning a typesense instance ?
but it produces totally different embeddings. any idea how to fix this ?
Could you elaborate on this? Any model will generate a set of floating point numbers for vectors, so it would be hard to tell which model was used to generate embeddings, just by looking at the vectors generated.
Could you share the exact code you're using and describe the problem you're running into, with one sample record?
I ran into the same problem where generating embeddings using SentenceTransformer
in python resulted in a vector that looks quite dissimilar from a vector generated using autoembeddings with the same model in typesense. Here's my reproduction:
from sentence_transformers import SentenceTransformer
import typesense
test_doc = {
"id": "test",
"title": "Fancy Document",
"content": "hello world",
}
model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
test_doc["embedding"] = model.encode(test_doc["content"])
client = typesense.Client(
{
"api_key": "abc",
"nodes": [{"host": "localhost", "port": "8108", "protocol": "http"}],
"connection_timeout_seconds": 30,
}
)
schema = {
"name": "test",
"fields": [
{"name": "title", "type": "string"},
{"name": "content", "type": "string"},
{"name": "embedding", "type": "float[]", "num_dim": 384},
{
"name": "autoembedding",
"type": "float[]",
"embed": {
"from": ["content"],
"model_config": {"model_name": "ts/all-MiniLM-L12-v2"},
},
},
],
}
collection = client.collections.create(schema)
collection.documents.create(test_doc)
doc = collection.documents["test"].retrieve()
doc["embedding"][0:10]
# [-0.07597321271896362, -0.005261972080916166, 0.011456318199634552, -0.06798461079597473, -0.0030688385013490915, -0.18362320959568024, 0.06599248945713043, 0.0294693261384964, -0.05323604866862297, 0.08215268701314926]
doc["autoembedding"][0:10]
# [-0.3613221049308777, -0.025025347247719765, 0.05448530614376068, -0.3233288526535034, -0.014595293439924717, -0.8732964992523193, 0.3138545751571655, 0.14015333354473114, -0.2531859576702118, 0.3907110095024109]
If I use autoembeddings for indexing and querying, I have observed that I get similar search results compared to if I do compute embeddings with sentence-transformer for both doc and query - so it seems the results are "correct", but the embeddings do not seem to be interchangeable between typesense generated vs sentence-transformers in python.
@xaptronic
If you used Python for generating embeddings but queried using Typesemse, that does not produce totally wrong results? Or they just produce different results (similar but not exact)?
@xaptronic
If you used Python for generating embeddings but queried using Typesemse, that does not produce totally wrong results? Or they just produce different results (similar but not exact)?
If I create an autoembedding field in typesense, but I provide the embedding vectors (from my python code) when I import the documents into Typesense, then I rely on Typesense autoembedding field to do embedding on the queries, based on some testing, I would say that the results are not completely wrong - the results bring up distantly related document snippets, but are definitely not the same result as when using autoembedding to embed the documents and query.
I found that the discrepancy for the embedding values between what typesense autoembedding creates vs using sentence transformers from python is that typesense does not perform any sort of normalization after mean_pooling. Not all sentence_transformers have this, but the model I was using does have a normalization layer.
@xaptronic
Good catch. Although technically we do convert data vectors to unit vectors (i.e. normalize them) before indexing into the hnsw vector index (likewise for the query vector), so even though the stored values are different, the indexed values will be normalized so the searches should be identical.
We will look into this.
@xaptronic
Good catch. Although technically we do convert data vectors to unit vectors (i.e. normalize them) before indexing into the hnsw vector index (likewise for the query vector), so even though the stored values are different, the indexed values will be normalized so the searches should be identical.
We will look into this.
Right, I did see that normalization routine in index.cpp - if sentence_transformer embeddings for a given model are already normalized, what would end up getting indexed when using pre-generated embeddings?
We have a step that converts both pre-generated and auto-generated embeddings into unit vectors before indexing into HNSW. While this should technically handle all differences, it's not. We will be investigating this in more detail in a few days once we wrap up a few ongoing tasks.
Here is a fully reproducible demo of how the same embedding model produces different search results for the same query. This uses dataset I created out of Huggingface, Typesense and Deeplearning.AI documentation. I chunked the text content into max 128 token chunks, then used both sentence_transformer
and typesense to create embeddings (using the same model) and I exported the typesense embeddings back into the dataset so it imports quickly for testing.
The dataset has two embedding columns: embedding
, ts_embedding
- embedding is the set of embeddings created using sentence_transformer, and ts_embedding came from typesense. I built a faiss / pandas dataframe pair as the challenger mechanism.
In this example, different search terms sometimes produce similar results, and other times less similar. Some examples I tried:
My assumption is that, since these are using the same model, the results would be identical - there could be a difference in how the similarity is computed in typesense vs faiss, however in various cases it seems the results are entirely wrong.
Hopefully this is helpful.
I used this C++ compiled with Typesense bazel setup to verify that embedding vectors generated by the ONNX model match the same (unnormalized) embedding vector when using sentence_transformers
.
@xaptronic Hi, we found a problem with our mean pooling function that causes miscalculation of means of embeddings in some cases, which also leads to difference between embeddings generated by sentence transformers and embeddings generated by our ONNX models. Fix (#1437) will be included to next RC build and upcoming 0.25.2 release.
@xaptronic
Can you please check against this build: 0.25.2.rc14
Description
we were trying auto embedding on a new collection but it was using way too much of the server resources and eventually caused the server to crash.
I was thinking to use the same model for embedding the data before indexing. and then typesense will only need to generate embeddings for the query on runtime.
I tried multiple solutions to do this yet there doesn't seem to be a way to have the schema use an already generated embeddings and internally generate embeddings for the query.
Steps to reproduce
- clone https://github.com/typesense/typesense-instantsearch-semantic-search-demo
- update the schema to auto embed the vectors db. just remove the from field for example.
Expected Behavior
you are able to query the collection normally with auto generated embedded for the query.
Actual Behavior
the schema will not be accepted and will still need to generate embedding automatically on the server
Metadata
Typesense Version: 0.25.1
OS: ubunto
Hi can you please share the fix. Specifically, did you change anything on the index.js call signatures in the index.js file?
@elihoole unfortunately I couldn't find a straightforward solution to generating the embeddings before indexing to typesense. all things I tried never led to outputting the same vector array.
the thing I ended up doing eventually was to index the data to another server other than the production server and export the data with embedding and indexing it then to the production server. that way the production server will no longer need to generate embeddings as it's already in the indexed data
Description
we were trying auto embedding on a new collection but it was using way too much of the server resources and eventually caused the server to crash.
I was thinking to use the same model for embedding the data before indexing. and then typesense will only need to generate embeddings for the query on runtime.
I tried multiple solutions to do this yet there doesn't seem to be a way to have the schema use an already generated embeddings and internally generate embeddings for the query.
Steps to reproduce
Expected Behavior
you are able to query the collection normally with auto generated embedded for the query.
Actual Behavior
the schema will not be accepted and will still need to generate embedding automatically on the server
Metadata
Typesense Version: 0.25.1
OS: ubunto