Use Auto Embedding for querying but not for indexing

alihassan0 commented 1 year ago

Description

we were trying auto embedding on a new collection but it was using way too much of the server resources and eventually caused the server to crash.

I was thinking to use the same model for embedding the data before indexing. and then typesense will only need to generate embeddings for the query on runtime.

I tried multiple solutions to do this yet there doesn't seem to be a way to have the schema use an already generated embeddings and internally generate embeddings for the query.

Steps to reproduce

clone https://github.com/typesense/typesense-instantsearch-semantic-search-demo
update the schema to auto embed the vectors db. just remove the from field for example.

Expected Behavior

you are able to query the collection normally with auto generated embedded for the query.

Actual Behavior

the schema will not be accepted and will still need to generate embedding automatically on the server

Metadata

Typesense Version: 0.25.1

OS: ubunto

jasonbosco commented 1 year ago

If you have an auto-embedding field in an empty collection, and then import documents with the same embedding field already set in your documents, then Typesense won't regenerate embeddings for those documents.

Then when you send a search query, only the query will use the built-in embedding model.

alihassan0 commented 1 year ago

Thanks a lot for the suggestion .. we tried this and it seemed to work.

I wonder if there could be an alternative solution to generate the embedding though that doesn't require the to spawn a full typesense server. but I am not sure how to do it exactly !

jasonbosco commented 1 year ago

alternative solution to generate the embedding though that doesn't require the to spawn a full typesense server

You could use a Python script like this to generate embeddings externally.

alihassan0 commented 1 year ago

I have tried using this route as well.. but I wasn't able to figure out how to use the same hugging face model in the python script. I have trying using it like this

model = SentenceTransformer('all-MiniLM-L12-v2')

but it produces totally different embeddings. any idea how to fix this ?

alihassan0 commented 1 year ago

Hello @jasonbosco ,

I tried a few more things and I still can't get it product similar embeddings. can you should some light on how to replicate embedding generation similar to typesense auto embeddings but without spawning a typesense instance ?

jasonbosco commented 1 year ago

but it produces totally different embeddings. any idea how to fix this ?

Could you elaborate on this? Any model will generate a set of floating point numbers for vectors, so it would be hard to tell which model was used to generate embeddings, just by looking at the vectors generated.

Could you share the exact code you're using and describe the problem you're running into, with one sample record?

xaptronic commented 11 months ago

I ran into the same problem where generating embeddings using SentenceTransformer in python resulted in a vector that looks quite dissimilar from a vector generated using autoembeddings with the same model in typesense. Here's my reproduction:

from sentence_transformers import SentenceTransformer
import typesense

test_doc = {
   "id": "test",
   "title": "Fancy Document",
   "content": "hello world",
}

model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
test_doc["embedding"] = model.encode(test_doc["content"])

client = typesense.Client(
    {
        "api_key": "abc",
        "nodes": [{"host": "localhost", "port": "8108", "protocol": "http"}],
        "connection_timeout_seconds": 30,
    }
)

schema = {
    "name": "test",
    "fields": [
        {"name": "title", "type": "string"},
        {"name": "content", "type": "string"},       
        {"name": "embedding", "type": "float[]", "num_dim": 384},
        {
            "name": "autoembedding",
            "type": "float[]",
            "embed": {
                "from": ["content"],
                "model_config": {"model_name": "ts/all-MiniLM-L12-v2"},
            },
        },
    ],
}

collection = client.collections.create(schema)
collection.documents.create(test_doc)
doc = collection.documents["test"].retrieve()

doc["embedding"][0:10]
# [-0.07597321271896362, -0.005261972080916166, 0.011456318199634552, -0.06798461079597473, -0.0030688385013490915, -0.18362320959568024, 0.06599248945713043, 0.0294693261384964, -0.05323604866862297, 0.08215268701314926]

doc["autoembedding"][0:10]
# [-0.3613221049308777, -0.025025347247719765, 0.05448530614376068, -0.3233288526535034, -0.014595293439924717, -0.8732964992523193, 0.3138545751571655, 0.14015333354473114, -0.2531859576702118, 0.3907110095024109]

If I use autoembeddings for indexing and querying, I have observed that I get similar search results compared to if I do compute embeddings with sentence-transformer for both doc and query - so it seems the results are "correct", but the embeddings do not seem to be interchangeable between typesense generated vs sentence-transformers in python.

kishorenc commented 11 months ago

@xaptronic

If you used Python for generating embeddings but queried using Typesemse, that does not produce totally wrong results? Or they just produce different results (similar but not exact)?

xaptronic commented 11 months ago

@xaptronic

If you used Python for generating embeddings but queried using Typesemse, that does not produce totally wrong results? Or they just produce different results (similar but not exact)?

If I create an autoembedding field in typesense, but I provide the embedding vectors (from my python code) when I import the documents into Typesense, then I rely on Typesense autoembedding field to do embedding on the queries, based on some testing, I would say that the results are not completely wrong - the results bring up distantly related document snippets, but are definitely not the same result as when using autoembedding to embed the documents and query.

xaptronic commented 11 months ago

I found that the discrepancy for the embedding values between what typesense autoembedding creates vs using sentence transformers from python is that typesense does not perform any sort of normalization after mean_pooling. Not all sentence_transformers have this, but the model I was using does have a normalization layer.

kishorenc commented 11 months ago

@xaptronic

Good catch. Although technically we do convert data vectors to unit vectors (i.e. normalize them) before indexing into the hnsw vector index (likewise for the query vector), so even though the stored values are different, the indexed values will be normalized so the searches should be identical.

We will look into this.

xaptronic commented 11 months ago

@xaptronic

Good catch. Although technically we do convert data vectors to unit vectors (i.e. normalize them) before indexing into the hnsw vector index (likewise for the query vector), so even though the stored values are different, the indexed values will be normalized so the searches should be identical.

We will look into this.

Right, I did see that normalization routine in index.cpp - if sentence_transformer embeddings for a given model are already normalized, what would end up getting indexed when using pre-generated embeddings?

kishorenc commented 10 months ago

We have a step that converts both pre-generated and auto-generated embeddings into unit vectors before indexing into HNSW. While this should technically handle all differences, it's not. We will be investigating this in more detail in a few days once we wrap up a few ongoing tasks.

xaptronic commented 10 months ago

Here is a fully reproducible demo of how the same embedding model produces different search results for the same query. This uses dataset I created out of Huggingface, Typesense and Deeplearning.AI documentation. I chunked the text content into max 128 token chunks, then used both sentence_transformer and typesense to create embeddings (using the same model) and I exported the typesense embeddings back into the dataset so it imports quickly for testing.

The dataset has two embedding columns: embedding, ts_embedding - embedding is the set of embeddings created using sentence_transformer, and ts_embedding came from typesense. I built a faiss / pandas dataframe pair as the challenger mechanism.

In this example, different search terms sometimes produce similar results, and other times less similar. Some examples I tried:

embedding
export
models
python script
upload dataset

My assumption is that, since these are using the same model, the results would be identical - there could be a difference in how the similarity is computed in typesense vs faiss, however in various cases it seems the results are entirely wrong.

Hopefully this is helpful.

xaptronic commented 10 months ago

I used this C++ compiled with Typesense bazel setup to verify that embedding vectors generated by the ONNX model match the same (unnormalized) embedding vector when using sentence_transformers.

ozanarmagan commented 10 months ago

@xaptronic Hi, we found a problem with our mean pooling function that causes miscalculation of means of embeddings in some cases, which also leads to difference between embeddings generated by sentence transformers and embeddings generated by our ONNX models. Fix (#1437) will be included to next RC build and upcoming 0.25.2 release.

kishorenc commented 10 months ago

@xaptronic

Can you please check against this build: 0.25.2.rc14

elihoole commented 7 months ago

Description

we were trying auto embedding on a new collection but it was using way too much of the server resources and eventually caused the server to crash.

I was thinking to use the same model for embedding the data before indexing. and then typesense will only need to generate embeddings for the query on runtime.

I tried multiple solutions to do this yet there doesn't seem to be a way to have the schema use an already generated embeddings and internally generate embeddings for the query.

Steps to reproduce

clone https://github.com/typesense/typesense-instantsearch-semantic-search-demo

update the schema to auto embed the vectors db. just remove the from field for example.

Expected Behavior

you are able to query the collection normally with auto generated embedded for the query.

Actual Behavior

the schema will not be accepted and will still need to generate embedding automatically on the server

Metadata

Typesense Version: 0.25.1

OS: ubunto

Hi can you please share the fix. Specifically, did you change anything on the index.js call signatures in the index.js file?

alihassan0 commented 1 week ago

@elihoole unfortunately I couldn't find a straightforward solution to generating the embeddings before indexing to typesense. all things I tried never led to outputting the same vector array.

the thing I ended up doing eventually was to index the data to another server other than the production server and export the data with embedding and indexing it then to the production server. that way the production server will no longer need to generate embeddings as it's already in the indexed data

typesense / typesense