marqo-ai / marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
https://www.marqo.ai/
Apache License 2.0
4.29k stars 183 forks source link

Can't find any configuration for OpenAI Embeddings #875

Open HasnainKhanNiazi opened 1 week ago

HasnainKhanNiazi commented 1 week ago

Hey, I am playing around with marqo, did multiple experiments and I am having a few questions.

  1. I can use any model given here to generate embeddings and create an index: https://docs.marqo.ai/2.8/Guides/Models-Reference/list_of_models/ ; But how can I use OpenAI text-embedding-03-large or any other model which is not available on huggingface.
  2. I used hf/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large to generate embeddings and created an index and now search isn't working as expected. For example, if I type in a search query "bike" then the first 3-4 retrieved documents are not even related to bikes.
  3. I have also tried using filter_string but in the case of filter_string, the results are empty list.

This is how I am using marqo index;

docs = []

for index, row in data.iterrows():
    local_dict = {}
    local_dict["title"] = row["title"]
    local_dict["description"] = row["markdown"]
    local_dict["attributes"] = row["attributes"]
    docs.append(local_dict)

mq = marqo.Client(url='http://localhost:8882')
results = mq.index("my-first-index").delete()
mq.create_index("my-first-index", model='hf/multilingual-e5-large')

mq.index("my-first-index").add_documents(docs,
    tensor_fields=["title", "description", "attributes"], client_batch_size=64
)

results_with_filters = mq.index("my-first-index").search(
    q="Bike", filter_string="price:[0 To 1000]"
)

results_without_filters = mq.index("my-first-index").search(
    q="Bike"
)

And above both queries are not working as expected. Any help or guidance will be appreciated. Thanks

tomhamer commented 1 week ago

Hey Hasnain, thanks for reaching out! Have you tried the regular e5/large embeddings? These are significantly more performant in english. If you need multi-lingual embeddings, openai doesnt support those at the moment. In any case, we don't currently support openai embeddings in Marqo.

Another option to get better performance would be to sign up for Marqtune so you can finetune your embeddings to improve them for your usecase.

wanliAlex commented 1 week ago

Another option is to generate your embeddings outside Marqo and use the custom embeddings feature when indexing documents and searching. Check here on how to index documents with custom vectors and here on how to search with your custom vectors.

HasnainKhanNiazi commented 1 week ago

Thanks @tomhamer @wanliAlex for the suggestions. I am having multi-lingual document (German, Italian, English). I will checkout the custom embeddings section as well.

One follow-up question related to filter_string, to the best of my understanding for filter string it is required to add values separately for example;

If I add price like this then the query filtering is working fine

mq.index("my-first-index").add_documents([
    {
        "Title": "The Travels of Marco Polo",
        "Description": "A 13th-century travelogue describing Polo's travels"
    }, 
    {
        "Title": "Extravehicular Mobility Unit (EMU)",
        "Description": "The EMU is a spacesuit that provides environmental protection, "
                       "mobility, life support, and communications for astronauts;  'price': '100'",
        "_id": "article_591",
        'price': '100',
    }],
    tensor_fields=["Description"]
)

But lets say price is added or written somewhere in the description then filter_string won't be working.

The main problem in my case is that if I keep adding new fields for each different attribute then I will end up having around 2000 fields which is way too much and that's why I am looking for a solution to do the matching/fuzzy matching in the description.

HasnainKhanNiazi commented 1 week ago

@tomhamer

If you need multi-lingual embeddings, openai doesnt support those at the moment.

What do you mean by this line? OpenAI text-embedding-03-large is multi-lingual and for simple vector search, it is giving me better results if I compare with any other Open source model but I wanna do some more keyword search like filter_string.