marqo-ai / marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
https://www.marqo.ai/
Apache License 2.0
4.57k stars 188 forks source link

[ENHANCEMENT] Non-vectorised Fields #157

Closed Jeadie closed 1 year ago

Jeadie commented 1 year ago

Overview

Currently tensors are produced for all string and image fields. The fixed size of tensors stored can lead to a significant increase in storage for text fields. There are many application and use cases where certain fields need only be keyword/lexically searched.

This is a feature request to allow end users to highlight text fields (or partial sub-strings) to skip tensorisation.

Proposed Design

The proposed design is to allow users to specific, per document if the field should have a tensor created. The default is for that each field will have a tensor, and a deny list, per add_documents, will specify if inference and tensorisation should be skipped.

Marqo Client

For the client, this would look like:

mq.index("an-index").add_documents([
    {
        "Title": "Palamedes",
        "Description": "Palamedes is set in the days before King Arthur's reign, and describes the adventures of the fathers of Arthur, Tristan, Erec and other knights of Camelot.",
        "ReferenceLocation": "ISBN"
        "Author": "Rustichello da Pisa"
    }],
    non_tensor_fields=["Author", "ReferenceLocation"]
)

Marqo

The kwarg non_tensor_fields will be a query parameter on the /indexes/{index_name}/documents POST. Since there is no index level defaults, marqo instances will default to vectorising all fields, f that don't have non_tensor_fields=f.

Alternative - Field Level DSL

An alternate design draws inspiration from special tokens within NLP (e.g. stop/start tokens) to let users designate non-tensor text on a per-character (and therefore, per field).

Marqo Client

When adding documents to an index, a user can specify that no tensor should be made for text, via tokens:

mq.index("an-index").add_documents([
    {
        "Title": "<marqo no_tensor>The Travels of Marco Polo </marqo no_tensor>",
        "Description": "A 13th-century travelogue describing Polo's travels",
    }
])

Or with py-marqo library support:

mq.index("an-index").add_documents([
    {
        "Title": marqo.without_tensor("The Travels of Marco Polo"),
        "Description": "A 13th-century travelogue describing Polo's travels",
    }
])

Which supports embedding non-tensor text within a larger, tensor field.

mq.index("an-index").add_documents([
    {
        "Title": f"Start of the text is tensored. {marqo.without_tensor("The Travels of Marco Polo")} The end of the text is tensored.",
        "Description": "A 13th-century travelogue describing Polo's travels",
    }
])

This provide two forms of flexibility:

  1. A per-character designation for what should have a tensors.
  2. A token/syntax DSL for further text-based functionality (e.g. emphasisations, chunk control)

Marqo

Marqo is responsible for parsing and understanding the text DSL/tokens, chunking and constructing tensors for the appropriate text sections, and storing the tensors accordingly. Marqo must do two new things:

  1. Chunk text without the ignored text sections (and skip chunking & inference if the entire field is ignored)
  2. Save the full-text for lexical search support.

Storage

No changes to storage will be needed.

Problems

Alternatives - Field level Denylist

An alternative approach is to specific text to not be tensorised on a field level (instead of a per document basis). This would allow users to specific on both index and document creation, if a field should not be tensorised.

Marqo Client

On a new index

mq.create_index("new-index", {
    "non_tensor_fields": ["Title", "another_text_field"],

    # Other settings, as before
    "model":"ViT-L/14"
})

# Only `Description` field will have an associated tensor.
mq.index("new-index").add_documents([
    {
        "Title": "The Travels of Marco Polo",
        "Description": "A 13th-century travelogue describing Polo's travels",
        "another_text_field": "ISBN"
    },
    {
        "Title": "Extravehicular Mobility Unit (EMU)",
        "Description": "The EMU is a spacesuit that provides environmental protection, "
                       "mobility, life support, and communications for astronauts",
        "_id": "article_591",
        "another_text_field": "online"
    }
])

Or on a new document

mq.index("new-index").add_documents([
    {
        "Title": "Palamedes",
        "Description": "Palamedes is set in the days before King Arthur's reign, and describes the adventures of the fathers of Arthur, Tristan, Erec and other knights of Camelot.",
        "another_text_field": "ISBN"
        "new_text_field": "Rustichello da Pisa"
    }],
    non_tensor_fields=["another_text_field", "new_text_field"] # This applies for all documents going forward
)

As per the above add_documents example, both new and existing fields can be converted to non-tensor fields. Converting from non-tensor fields to tensor fields is not within the scope of this feature request as it would cause an unexpected search experience (a field now having tensors to search over will miss the original documents that were not tensorised). non_tensor_fields, then, represents a one-way door on index fields.

Marqo

Non-tensor fields can be stored at an index level, specifically, in the _meta.index_settings field from the _mappings call (see get_index_info). This information is available when adding documents. Marqo can then determine which fields to perform inference on. Cache staleness on index information only has the penalty of unneeded inference. When a field is marked as non-tensor, tensors can still be stored against them (i.e. this will avoid issues with rolling update of index information on marqo instances).

Storage

As mentioned above, whether a field should have tensors is denylisted. Further, it can be stored in an index level on the JSON structure key ._meta.index_settings. add_index already updates this attribute. If non_tensor_fields is non-empty with new fields, add_documents will have to update the index info.

Open Issues

Jeadie commented 1 year ago

Required Changes

marqo-ai/marqo

marqo-ai/py-marqo

marqo-ai/marqodocs