Closed Jeadie closed 1 year ago
Add parameter, non_tensor_fields
to add_documents_orchestrator
Add parameter, non_tensor_fields
to add_documents
Add parameter, non_tensor_fields
to add_documents_mp
and in __init__
of IndexChunk.
Within IndexChunk.process, call add_documents with non_tensor_fields
value.
Add query parameter to and pass through to add_documents_orchestrator
:
In add_documents
update line:40, if isinstance(field_content, (str, Image.Image)):
to exclude on non_tensor_fields
non_tensor_fields
to add_documents
. Default to empty list.non_tensor_fields
Overview
Currently tensors are produced for all string and image fields. The fixed size of tensors stored can lead to a significant increase in storage for text fields. There are many application and use cases where certain fields need only be keyword/lexically searched.
This is a feature request to allow end users to highlight text fields (or partial sub-strings) to skip tensorisation.
Proposed Design
The proposed design is to allow users to specific, per document if the field should have a tensor created. The default is for that each field will have a tensor, and a deny list, per
add_documents
, will specify if inference and tensorisation should be skipped.Marqo Client
For the client, this would look like:
Marqo
The kwarg
non_tensor_fields
will be a query parameter on the/indexes/{index_name}/documents
POST. Since there is no index level defaults, marqo instances will default to vectorising all fields,f
that don't havenon_tensor_fields=f
.Alternative - Field Level DSL
An alternate design draws inspiration from special tokens within NLP (e.g. stop/start tokens) to let users designate non-tensor text on a per-character (and therefore, per field).
Marqo Client
When adding documents to an index, a user can specify that no tensor should be made for text, via tokens:
Or with py-marqo library support:
Which supports embedding non-tensor text within a larger, tensor field.
This provide two forms of flexibility:
Marqo
Marqo is responsible for parsing and understanding the text DSL/tokens, chunking and constructing tensors for the appropriate text sections, and storing the tensors accordingly. Marqo must do two new things:
Storage
No changes to storage will be needed.
Problems
Alternatives - Field level Denylist
An alternative approach is to specific text to not be tensorised on a field level (instead of a per document basis). This would allow users to specific on both index and document creation, if a field should not be tensorised.
Marqo Client
On a new index
Or on a new document
As per the above
add_documents
example, both new and existing fields can be converted to non-tensor fields. Converting from non-tensor fields to tensor fields is not within the scope of this feature request as it would cause an unexpected search experience (a field now having tensors to search over will miss the original documents that were not tensorised).non_tensor_fields
, then, represents a one-way door on index fields.Marqo
Non-tensor fields can be stored at an index level, specifically, in the
_meta.index_settings
field from the_mappings
call (see get_index_info). This information is available when adding documents. Marqo can then determine which fields to perform inference on. Cache staleness on index information only has the penalty of unneeded inference. When a field is marked as non-tensor, tensors can still be stored against them (i.e. this will avoid issues with rolling update of index information on marqo instances).Storage
As mentioned above, whether a field should have tensors is denylisted. Further, it can be stored in an index level on the JSON structure key
._meta.index_settings
.add_index
already updates this attribute. Ifnon_tensor_fields
is non-empty with new fields,add_documents
will have to update the index info.Open Issues
treat_urls_and_pointers_as_images
to be a per-field level?