OwenPendrighElliott commented 1 year ago

Is your feature request related to a problem? Please describe. Many new SOTA models for retrieval are trained with prefixes on the queries and documents, thus they expect these at inference as well. Failure to provide these degrades performance. Currently there is no way in Marqo to add a prefix or suffix to every chunk in a document as text pre-processing occurs internally.

I don't know of any models with a requirement for a specific suffix so the suffixing part can be removed if we don't want to pre-empt a use case that doesn't exist yet.

Describe the solution you'd like New optional parameters are added to search and add_documents to enable prefixes like so:

import marqo

mq = marqo.Client()

mq.index("my-index").add_documents(
    [{"_id": 1, "text": "lorum ipsum"}],
    text_chunk_prefix="passage: ",
    text_chunk_suffix="",
    tensor_fields=["text"]
)

mq.index("my-index").search(
    "lorum ipsum",
    query_prefix="query: ",
    query_suffix=""
)

mq.index("my-index").search(
    {"lorum": 1.0, "ipsum": 0.2},
    query_prefix="query: ",
    query_suffix=""
)

When the text is chunked internally, each chunk should have the prefix added to it so that it can be tokenised and passed to the model to match the prefixing applied in training.

For queries the prefix is added to the front of every query that isn't a URL. It should be added to the front of every query in a weighted query.

Describe alternatives you've considered These could be part of the index defaults however if a model appears in the future with different prefixes for different sub tasks (or something similar) then this won't be a good approach because it locks you in. One nice thing though is that this approach means users won't forget to add them and the correct prefix can me included in the default models.

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": False, 
        "text_preprocessing": {
            "split_method": "sentence", 
            "split_length": 3, 
            "split_overlap": 1,
        }, 
        "model_properties": {
            "name": "intfloat/e5-base-v2", 
            "type": "hf", 
            "dimensions": 768,
            "query_prefix": "query: ",
            "query_suffix": "",
            "text_chunk_prefix": "passage: ",
            "text_chunk_suffix": ""
        }, 
        "model": "hf-e5-base-v2", 
        "normalize_embeddings": True, 
    },
}

Additional context

BAAI BGE Expects Represent this sentence for searching relevant passages: as a prefix https://github.com/FlagOpen/FlagEmbedding
E5 expects query: in front of queries and passage: in front of passages https://huggingface.co/intfloat/e5-large
E5 V2 expects query: in front of queries and passage: in front of passages https://huggingface.co/intfloat/e5-large-v2

pandu-k commented 1 year ago

@OwenPendrighElliott thanks for this!

Is it possible to add this to the model registry, so that it is hidden from the main add_docs/search workflows? Then, we can override these values by user specified values if needed?

OwenPendrighElliott commented 1 year ago

Yep, we could combine the two solutions I proposed. Have it as part of index defaults and have an override in add_documents and search. This would mean that models from the registry could have them provided out of the box, users could set them for the index with custom models, and you could override them in the code if you wanted.

How did you envisage the override? I think adding parameters to the add_documents and search is sensible. Perhaps we start with just the prefix and drop the suffix for the moment?

pandu-k commented 1 year ago

are there any models that expect a suffix? if not then we can leave it out of scope until a need arises.

Yep I think the overrides as add_documents or search parameters, as you suggested, is fine

OwenPendrighElliott commented 1 year ago

Not that I am aware of, lets leave it out.

pandu-k commented 1 year ago

Proposal

For models in the model registry

The default text_query_prefix and text_chunk_prefix are defined in the model_registry entry, and so are not immediately visible to non-power users:

'SOME/MODEL':  {
    "name": "SOME/MODEL",
    "dimensions": 512,
    "type": "clip",
    "text_query_prefix": "query:",
    "text_chunk_prefix": "passage:"
 }

Because this is added to the registry, index settings will not change::

{
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "model": "SOME/MODEL",
        "normalize_embeddings": true
        "text_preprocessing": {

        }
    },

}

By default, you do not need to specify the prefixes:

# add docs:
my_index.add_documents([
        {"title": "Cool books" }
    ],
    tensor_fields = ["title"]
)

# search:
search_res = my_index.search(
    "interesting books"
)
pprint(search_res)

# search response: 
{
  "hits": [
      {"title": "Cool book" },
      "_id": "article_591",
      "_score": 1.2387788,

      # notice the text_chunk_prefix is present in highlight:
      "_highlights": {"title": "passage: Cool book"}
    }],
   "limit": 10,
   "offset": 0,
   "processingTimeMs": 49,

   # notice the  text_query_prefix in the query:
   "query": "query: What is the best outfit to wear on the moon?"
}

But you can override the default prefixes:

# add docs:
my_index.add_documents([
        {"title": "Cool books" }
    ],
    tensor_fields = ["title"],
    text_chunk_prefix="this is a passage: ",
)

# search:
search_res = my_index.search(
    q="interesting books",
    text_query_prefix="This is a question: "
)
pprint(search_res)

# search response: 
{
  "hits": [
      {"title": "Cool book" },
      "_id": "article_591",
      "_score": 1.2387788,

      # notice the text_chunk_prefix is present in highlight:
      "_highlights": {"title": "this is a passage: Cool book"}
    }],
   "limit": 10,
   "offset": 0,
   "processingTimeMs": 49,

   # notice the  text_query_prefix in the query:
   "query": "This is a question: What is the best outfit to wear on the moon?"
}

For generic models

If you really want to change the defaults prefixes on a registry model, you should load it using the generic model method.

Index settings:

{
    "index_defaults": {
        "treat_urls_and_pointers_as_images": true,
        "model": "my-model-alias",
        "model_properties": {
            "name": "SOME/MODEL",
            "dimensions": 512,
            "type": "clip",
            "text_query_prefix": "BEGIN QUERY:",
            "text_chunk_prefix": "BEGIN PASSAGE:"
        },
        "normalize_embeddings": True,
    }
}

Other behaviour is the same as registry-defined models:

You can override these prefixes at add documents and search time using text_query_prefix and text_chunk_prefix
These prefixes are visible in highlights and the search response object

pandu-k commented 1 year ago

Update to response: remove prefix from search responses. Instead a dedicated field is used to communicate this.

# add docs:
my_index.add_documents([
        {"title": "Cool books" }
    ],
    tensor_fields = ["title"]
)

# search:
search_res = my_index.search(
    "interesting books"
)
pprint(search_res)

# search response: 
{
  "hits": [
      {"title": "Cool book" },
      "_id": "article_591",
      "_score": 1.2387788,
      "_text_chunk_prefix": "passage: ",
      "_highlights": {"Cool book"}
    }],
   "limit": 10,
   "offset": 0,
   "processingTimeMs": 49,
   "textQueryPrefix": "query: ",
   "query": "What is the best outfit to wear on the moon?"
}

Edge cases:

If the model is specified with a non-empty string prefix, then the customer overrides it with an empty string, the prefix key will still end up in the results as a
If these prefixes are defined as empty strings, they will not be returned in the response
how does Marqo know what the prefix was for a document, if it is added at add_docs time? We can't really parse it out

pandu-k commented 1 year ago

Decision:

No overriding of prefixes

vicilliar commented 12 months ago

@pandu-k clarifying your last comment: We will no longer be allowing overriding of prefixes at index creation time? Or at add docs/search time? Or both?

vicilliar commented 12 months ago

Another issue: it feels unintuitive that in the response, the field name for a doc chunk prefix is _text_chunk_prefix, but the name for search prefix is textQueryPrefix, especially since they're worded in the same way and users will always use snake case to define their prefixes.

vicilliar commented 12 months ago

Under the edge case:

If these prefixes are defined as empty strings, they will not be returned in the response

This means defined in the model registry right? In this case, if the model registry explicitly defines the prefix as empty string "", I feel it should be returned in the response as "".

vicilliar commented 12 months ago

Clarifications:

One argument for adding prefix to index settings instead of model registry properties: could a model expect 2 different prefixes? for different situations?
Why don’t we put prefix in the text_preprocessing portion of the index settings object?
To override default prefix for a registry model , the proposal mentions to load it using the generic model method (with model properties). Does this mean you would always have to provide url or localpath? Feels unnecessary for a user to have to define all other mode properties just to change the default prefix.
If a model in registry does NOT have prefix set, it should be fine, and ignored. Should be the same if prefix is “”.
Possible idea: add a hidden field in each document __text_chunk_prefix so that it can be parsed out upon search (from the highlights)
Multimodal: Do we add prefix to text in multimodal fields?
We only add search prefix in LEXICAL search, right? Do we omit prefix from the search response object if lexical also?

marqo-ai / marqo

[ENHANCEMENT] Support for prefixing and suffixing of text #585

Proposal

For models in the model registry

For generic models