marqo-ai / marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
https://www.marqo.ai/
Apache License 2.0
4.56k stars 188 forks source link

Aggregations [ENHANCEMENT] #476

Open pandu-k opened 1 year ago

pandu-k commented 1 year ago

Is your feature request related to a problem? Please describe. There are limited aggregation options in Marqo.

Describe the solution you'd like Min, max, sum, mean of a field. Count of unique values taken on by a field.

For example: the sum of a field across all docs in the index (perhaps with filtering).

Describe alternatives you've considered Doing the analysis in a different database. The downside is that this increase application complexity

jess-lord commented 1 year ago

I think a unique set of values from a field would be useful too. For example: doc1 tags: [red, blue] doc2 tags: [blue] doc3 tags: [yellow, blue]

mq.index.docs.tags().unique() -> [red, yellow, blue]

jess-lord commented 1 year ago

I am bumping into this requirement again and think I am going to have to start putting a special metadata/aggregation record into each of the marqo indexes as a workaround. Probably going to need to instroduce another persistance layer altogether now that I think about it.

It's a little more complex than the above example because I need to do a groupby group, e.g. source_pdf1 -> docs -> tags: [red, blue] source_pdf2 -> docs -> tags: [blue] source_pdf3 -> docs -> [yellow, blue]

The goal is to count the number of pdfs that have docs with certain tags. Pdfs don't exist anymore, they are just another piece of metadata on the docs, but I hope the use case is clear.

sky-2002 commented 10 months ago

@pandu-k Can't we integrate a separate package, for example - pandas or polars(for larger data) which could handle these aggregation calls. These tools are specifically designed for that, so we can maybe send/stream the data from marqo to these tools and perform the aggregations. Is this feasible?