[ENHANCEMENT] Search Operation Should Return Multiple Highlights.

marqo-ai / marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

https://www.marqo.ai/

Apache License 2.0

4.66k stars 193 forks source link

[ENHANCEMENT] Search Operation Should Return Multiple Highlights. #66

Open aryanagarwal9 opened 2 years ago

aryanagarwal9 commented 2 years ago

Is your feature request related to a problem? Please describe. Currently, a search operation only returns one highlight for each indexed document.

Describe the solution you'd like Get the option to specify the number of highlights to be returned for each indexed document.

Describe alternatives you've considered None

Additional context I am creating a podcast-demo-code, wherein I index two documents, and each document has the name of the podcast, a short description, and the full transcript. So whenever I perform a search operation, it just returns one highlight over the whole transcript, I think it will be good if there is an option to return multiple highlights.

edmuthiah commented 1 year ago

@pandu-k multiple highlights from multiple documents would be awesome. For example:

If you imagine a query like 'what is the contract number, who are the signatories and summarise the scope of work'

The in Document 1 contract number might be on page 1 the signatories might be on page 100. Then then the scope of work might be in Document 2 on page 12.

This would also be really useful for the style of questions which are:

'compare these two documents'
'what are the similarities and differences between these sections of these documents'
'what is the difference between version 1 and version 1.5'

jess-lord commented 1 year ago

@aryanagarwal9 Maybe I misunderstand, but couldn't a smaller document size get the job done? Can also add overlap on your chunks if worried about missing context. For example: "index_defaults": { "treat_urls_and_pointers_as_images": False, "model": "hf/all_datasets_v4_MiniLM-L6", "normalize_embeddings": True, "text_preprocessing": { "split_method": "sentence", "split_length": 2, "split_overlap": 1, } } See: https://docs.marqo.ai/0.0.18/API-Reference/indexes/#text-preprocessing-object

So if your podcast transcript is 100 "pages", this might become 100 marqo "documents" and within each of these documents there will be n "chunks" (aka facets) where, using the above settings, each chunk would be 2 sentences, with a stride/overlap of 1 sentence between them. We would then get 1 highlight per "page", which maybe is insufficient. But couldn't you just split your pages into something even smaller, such as paragraphs, to achieve the desired result?

edmuthiah commented 1 year ago

Hey @jess-lord I don't think the above solution scales. If you have 30 pdfs with 100 pages each. You now have 3000 documents that will each return a highlight. You then need to find the answer you are looking for amongst these 3000 highlights using some other method which defeats the original purpose of finding the actual highlight. If you were using an LLM your token count/cost to process 3000 sentences per query would be high too (if not exceeding the limit).

jess-lord commented 1 year ago

@edmuthiah I was responding to the podcast use case, which I still think this covers because the facets can be retrieved independently of their "parent document". But for your use case (which I too am now bumping into) I agree. The only alternative I can come up with for the moment is tags and weighted queries.