marqo-ai / marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
https://www.marqo.ai/
Apache License 2.0
4.61k stars 189 forks source link

[ENHANCEMENT] Support for prefixing and suffixing of text #585

Open OwenPendrighElliott opened 1 year ago

OwenPendrighElliott commented 1 year ago

Is your feature request related to a problem? Please describe. Many new SOTA models for retrieval are trained with prefixes on the queries and documents, thus they expect these at inference as well. Failure to provide these degrades performance. Currently there is no way in Marqo to add a prefix or suffix to every chunk in a document as text pre-processing occurs internally.

I don't know of any models with a requirement for a specific suffix so the suffixing part can be removed if we don't want to pre-empt a use case that doesn't exist yet.

Describe the solution you'd like New optional parameters are added to search and add_documents to enable prefixes like so:

import marqo

mq = marqo.Client()

mq.index("my-index").add_documents(
    [{"_id": 1, "text": "lorum ipsum"}],
    text_chunk_prefix="passage: ",
    text_chunk_suffix="",
    tensor_fields=["text"]
)

mq.index("my-index").search(
    "lorum ipsum",
    query_prefix="query: ",
    query_suffix=""
)

mq.index("my-index").search(
    {"lorum": 1.0, "ipsum": 0.2},
    query_prefix="query: ",
    query_suffix=""
)

When the text is chunked internally, each chunk should have the prefix added to it so that it can be tokenised and passed to the model to match the prefixing applied in training.

For queries the prefix is added to the front of every query that isn't a URL. It should be added to the front of every query in a weighted query.

Describe alternatives you've considered These could be part of the index defaults however if a model appears in the future with different prefixes for different sub tasks (or something similar) then this won't be a good approach because it locks you in. One nice thing though is that this approach means users won't forget to add them and the correct prefix can me included in the default models.

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": False, 
        "text_preprocessing": {
            "split_method": "sentence", 
            "split_length": 3, 
            "split_overlap": 1,
        }, 
        "model_properties": {
            "name": "intfloat/e5-base-v2", 
            "type": "hf", 
            "dimensions": 768,
            "query_prefix": "query: ",
            "query_suffix": "",
            "text_chunk_prefix": "passage: ",
            "text_chunk_suffix": ""
        }, 
        "model": "hf-e5-base-v2", 
        "normalize_embeddings": True, 
    },
}

Additional context

pandu-k commented 1 year ago

@OwenPendrighElliott thanks for this!

Is it possible to add this to the model registry, so that it is hidden from the main add_docs/search workflows? Then, we can override these values by user specified values if needed?

OwenPendrighElliott commented 1 year ago

Yep, we could combine the two solutions I proposed. Have it as part of index defaults and have an override in add_documents and search. This would mean that models from the registry could have them provided out of the box, users could set them for the index with custom models, and you could override them in the code if you wanted.

How did you envisage the override? I think adding parameters to the add_documents and search is sensible. Perhaps we start with just the prefix and drop the suffix for the moment?

pandu-k commented 1 year ago

are there any models that expect a suffix? if not then we can leave it out of scope until a need arises.

Yep I think the overrides as add_documents or search parameters, as you suggested, is fine

OwenPendrighElliott commented 1 year ago

Not that I am aware of, lets leave it out.

pandu-k commented 1 year ago

Proposal

For models in the model registry

The default text_query_prefix and text_chunk_prefix are defined in the model_registry entry, and so are not immediately visible to non-power users:

'SOME/MODEL':  {
    "name": "SOME/MODEL",
    "dimensions": 512,
    "type": "clip",
    "text_query_prefix": "query:",
    "text_chunk_prefix": "passage:"
 }

Because this is added to the registry, index settings will not change::

{
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "model": "SOME/MODEL",
        "normalize_embeddings": true
        "text_preprocessing": {

        }
    },

}

By default, you do not need to specify the prefixes:

# add docs:
my_index.add_documents([
        {"title": "Cool books" }
    ],
    tensor_fields = ["title"]
)

# search:
search_res = my_index.search(
    "interesting books"
)
pprint(search_res)

# search response: 
{
  "hits": [
      {"title": "Cool book" },
      "_id": "article_591",
      "_score": 1.2387788,

      # notice the text_chunk_prefix is present in highlight:
      "_highlights": {"title": "passage: Cool book"}
    }],
   "limit": 10,
   "offset": 0,
   "processingTimeMs": 49,

   # notice the  text_query_prefix in the query:
   "query": "query: What is the best outfit to wear on the moon?"
}

But you can override the default prefixes:

# add docs:
my_index.add_documents([
        {"title": "Cool books" }
    ],
    tensor_fields = ["title"],
    text_chunk_prefix="this is a passage: ",
)

# search:
search_res = my_index.search(
    q="interesting books",
    text_query_prefix="This is a question: "
)
pprint(search_res)

# search response: 
{
  "hits": [
      {"title": "Cool book" },
      "_id": "article_591",
      "_score": 1.2387788,

      # notice the text_chunk_prefix is present in highlight:
      "_highlights": {"title": "this is a passage: Cool book"}
    }],
   "limit": 10,
   "offset": 0,
   "processingTimeMs": 49,

   # notice the  text_query_prefix in the query:
   "query": "This is a question: What is the best outfit to wear on the moon?"
}

For generic models

If you really want to change the defaults prefixes on a registry model, you should load it using the generic model method.

Index settings:

{
    "index_defaults": {
        "treat_urls_and_pointers_as_images": true,
        "model": "my-model-alias",
        "model_properties": {
            "name": "SOME/MODEL",
            "dimensions": 512,
            "type": "clip",
            "text_query_prefix": "BEGIN QUERY:",
            "text_chunk_prefix": "BEGIN PASSAGE:"
        },
        "normalize_embeddings": True,
    }
}

Other behaviour is the same as registry-defined models:

pandu-k commented 1 year ago

Update to response: remove prefix from search responses. Instead a dedicated field is used to communicate this.

# add docs:
my_index.add_documents([
        {"title": "Cool books" }
    ],
    tensor_fields = ["title"]
)

# search:
search_res = my_index.search(
    "interesting books"
)
pprint(search_res)

# search response: 
{
  "hits": [
      {"title": "Cool book" },
      "_id": "article_591",
      "_score": 1.2387788,
      "_text_chunk_prefix": "passage: ",
      "_highlights": {"Cool book"}
    }],
   "limit": 10,
   "offset": 0,
   "processingTimeMs": 49,
   "textQueryPrefix": "query: ",
   "query": "What is the best outfit to wear on the moon?"
}

Edge cases:

pandu-k commented 1 year ago

Decision:

vicilliar commented 12 months ago

@pandu-k clarifying your last comment: We will no longer be allowing overriding of prefixes at index creation time? Or at add docs/search time? Or both?

vicilliar commented 12 months ago

Another issue: it feels unintuitive that in the response, the field name for a doc chunk prefix is _text_chunk_prefix, but the name for search prefix is textQueryPrefix, especially since they're worded in the same way and users will always use snake case to define their prefixes.

vicilliar commented 12 months ago

Under the edge case:

If these prefixes are defined as empty strings, they will not be returned in the response

This means defined in the model registry right? In this case, if the model registry explicitly defines the prefix as empty string "", I feel it should be returned in the response as "".

vicilliar commented 12 months ago

Clarifications: