Open OwenPendrighElliott opened 1 year ago
@OwenPendrighElliott thanks for this!
Is it possible to add this to the model registry, so that it is hidden from the main add_docs/search workflows? Then, we can override these values by user specified values if needed?
Yep, we could combine the two solutions I proposed. Have it as part of index defaults and have an override in add_documents
and search
. This would mean that models from the registry could have them provided out of the box, users could set them for the index with custom models, and you could override them in the code if you wanted.
How did you envisage the override? I think adding parameters to the add_documents
and search
is sensible. Perhaps we start with just the prefix and drop the suffix for the moment?
are there any models that expect a suffix? if not then we can leave it out of scope until a need arises.
Yep I think the overrides as add_documents or search parameters, as you suggested, is fine
Not that I am aware of, lets leave it out.
The default text_query_prefix and text_chunk_prefix are defined in the model_registry entry, and so are not immediately visible to non-power users:
'SOME/MODEL': {
"name": "SOME/MODEL",
"dimensions": 512,
"type": "clip",
"text_query_prefix": "query:",
"text_chunk_prefix": "passage:"
}
Because this is added to the registry, index settings will not change::
{
"index_defaults": {
"treat_urls_and_pointers_as_images": True,
"model": "SOME/MODEL",
"normalize_embeddings": true
"text_preprocessing": {
}
},
}
By default, you do not need to specify the prefixes:
# add docs:
my_index.add_documents([
{"title": "Cool books" }
],
tensor_fields = ["title"]
)
# search:
search_res = my_index.search(
"interesting books"
)
pprint(search_res)
# search response:
{
"hits": [
{"title": "Cool book" },
"_id": "article_591",
"_score": 1.2387788,
# notice the text_chunk_prefix is present in highlight:
"_highlights": {"title": "passage: Cool book"}
}],
"limit": 10,
"offset": 0,
"processingTimeMs": 49,
# notice the text_query_prefix in the query:
"query": "query: What is the best outfit to wear on the moon?"
}
But you can override the default prefixes:
# add docs:
my_index.add_documents([
{"title": "Cool books" }
],
tensor_fields = ["title"],
text_chunk_prefix="this is a passage: ",
)
# search:
search_res = my_index.search(
q="interesting books",
text_query_prefix="This is a question: "
)
pprint(search_res)
# search response:
{
"hits": [
{"title": "Cool book" },
"_id": "article_591",
"_score": 1.2387788,
# notice the text_chunk_prefix is present in highlight:
"_highlights": {"title": "this is a passage: Cool book"}
}],
"limit": 10,
"offset": 0,
"processingTimeMs": 49,
# notice the text_query_prefix in the query:
"query": "This is a question: What is the best outfit to wear on the moon?"
}
If you really want to change the defaults prefixes on a registry model, you should load it using the generic model method.
Index settings:
{
"index_defaults": {
"treat_urls_and_pointers_as_images": true,
"model": "my-model-alias",
"model_properties": {
"name": "SOME/MODEL",
"dimensions": 512,
"type": "clip",
"text_query_prefix": "BEGIN QUERY:",
"text_chunk_prefix": "BEGIN PASSAGE:"
},
"normalize_embeddings": True,
}
}
Other behaviour is the same as registry-defined models:
Update to response: remove prefix from search responses. Instead a dedicated field is used to communicate this.
# add docs:
my_index.add_documents([
{"title": "Cool books" }
],
tensor_fields = ["title"]
)
# search:
search_res = my_index.search(
"interesting books"
)
pprint(search_res)
# search response:
{
"hits": [
{"title": "Cool book" },
"_id": "article_591",
"_score": 1.2387788,
"_text_chunk_prefix": "passage: ",
"_highlights": {"Cool book"}
}],
"limit": 10,
"offset": 0,
"processingTimeMs": 49,
"textQueryPrefix": "query: ",
"query": "What is the best outfit to wear on the moon?"
}
Edge cases:
Decision:
@pandu-k clarifying your last comment: We will no longer be allowing overriding of prefixes at index creation time? Or at add docs/search time? Or both?
Another issue: it feels unintuitive that in the response, the field name for a doc chunk prefix is _text_chunk_prefix
, but the name for search prefix is textQueryPrefix
, especially since they're worded in the same way and users will always use snake case to define their prefixes.
Under the edge case:
If these prefixes are defined as empty strings, they will not be returned in the response
This means defined in the model registry right? In this case, if the model registry explicitly defines the prefix as empty string ""
, I feel it should be returned in the response as ""
.
Clarifications:
text_preprocessing
portion of the index settings object?url
or localpath
? Feels unnecessary for a user to have to define all other mode properties just to change the default prefix.“”
.__text_chunk_prefix
so that it can be parsed out upon search (from the highlights)
Is your feature request related to a problem? Please describe. Many new SOTA models for retrieval are trained with prefixes on the queries and documents, thus they expect these at inference as well. Failure to provide these degrades performance. Currently there is no way in Marqo to add a prefix or suffix to every chunk in a document as text pre-processing occurs internally.
I don't know of any models with a requirement for a specific suffix so the suffixing part can be removed if we don't want to pre-empt a use case that doesn't exist yet.
Describe the solution you'd like New optional parameters are added to
search
andadd_documents
to enable prefixes like so:When the text is chunked internally, each chunk should have the prefix added to it so that it can be tokenised and passed to the model to match the prefixing applied in training.
For queries the prefix is added to the front of every query that isn't a URL. It should be added to the front of every query in a weighted query.
Describe alternatives you've considered These could be part of the index defaults however if a model appears in the future with different prefixes for different sub tasks (or something similar) then this won't be a good approach because it locks you in. One nice thing though is that this approach means users won't forget to add them and the correct prefix can me included in the default models.
Additional context
Represent this sentence for searching relevant passages:
as a prefix https://github.com/FlagOpen/FlagEmbeddingquery:
in front of queries andpassage:
in front of passages https://huggingface.co/intfloat/e5-largequery:
in front of queries andpassage:
in front of passages https://huggingface.co/intfloat/e5-large-v2