[ENHANCEMENT] Wildcard support

marqo-ai / marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

https://www.marqo.ai/

Apache License 2.0

4.32k stars 182 forks source link

[ENHANCEMENT] Wildcard support #770

Open jess-lord opened 4 months ago

jess-lord commented 4 months ago

Is your feature request related to a problem? Please describe. Marqo 1.4 supported wildcards in the query string, which we relied on to do metadata-only filters and queries.

Describe the solution you'd like Please support wildcard queries again.

Describe alternatives you've considered The only alternative for us is to stay on marqo 1.x

Additional context This worked in marqo 1.4 but 2.2 does not return the records. These are metadata records that have no content {"q":"*", "filter":"tag:_summary", "searchMethod":"LEXICAL"} A workaround here can be to set the query to _summary but that doesn't work for the next example. {"q":"*", "filter":"NOT topic:(Trolling) AND content:(Trolling)", "searchMethod":"LEXICAL", "searchableAttributes":["content"]} This used to work but now returns 0 results, and I can't set the query to Trolling because I need a literal match on that string (lexical is fuzzy and will return results for permutations like Troll). The records do not have any value set for their topic attribute. Content is a tensor field in a structured index that is configured to also have lexical and filter during index creation.

farshidz commented 2 months ago

Hi @jess-lord . Thanks for raising this issue. Is your requirement to search only based on a filter with no query? Or do you intend to use the wildcard potentially as part of a string? e.g. q="somevalue*" for a prefix search

In the meantime, I believe having q="Trolling" with a filter could in fact give you the desired outcome. Your query might match content=Troll due to linguistic processing (stemming), but the filter will eliminate those results.

Here's an example I just tried

ix.add_documents(
    documents=[
        {
            '_id': '1',
            'title': 'Trolling',
            'topic': 'Fun'
,        },
        {
            '_id': '2',
            'title': 'Troll',
            'topic': 'Fun'
,        }
    ],
    tensor_fields=[]
)

response = ix.search(q='Trolling', limit=10, search_method="lexical", filter_string='NOT topic:(Trolling) AND title:(Trolling)')

response['hits']

Results:

[{'title': 'Trolling',
  'topic': 'Fun',
  '_id': '1',
  '_score': 0.1823215567939546,
  '_highlights': []}]

As you can see, this didn't return the document with title=Troll.

jess-lord commented 1 month ago

@farshidz Thanks for looking into this. I'm looking for exact token matches, so "troll" should match "the troll under the bridge" but not "the trolling of online forums". The use case is to search marqo document content for important keywords that need an exact match. So the filter would target the "content" property of the documents. Maybe a more abstract example is easier:

ix.add_documents(
    documents=[
        {
            '_id': '1',
            'content': 'lorem ipusm abc1 lorem',
            'topic': ''
,        },
        {
            '_id': '2',
            'content': 'lorem ipusm abc110 lorem',
            'topic': ''
,        }
    ],
    tensor_fields=[content]
)

In this example my objective is to filter the index for docs with content of abc1, and tag all matching results with a topic of genreA, and tag docs containing abc110 with genreB. When filtering for "abc1" I don't want to get this second document.

farshidz commented 1 week ago

@jess-lord since Marqo 2.7, you can now search with q="*" like you did with Marqo 1 (searching using only your filters). However, this doesn't immediately enable exact matching of token within a string. This is because

Lexical (inverted) indexes (lexical_search feature) store processed/stemmed tokens
Filter indexes (filter feature) can only exact match the full string for efficiency reasons

The best workaround I can think of is to split your text (content in the example above) based on whitespace to create a list and store this as an array<string> field in Marqo (if using an unstructured index, just pass the list as a document field and the type will be inferred). Then searching with q="*" and filter_string="content:abc1" will achieve what you want.