Lookup aggregations hidden positives

konzz commented 4 years ago

With the lookup aggregations, we discovered a bug that hides positives.

In this example we will search for "Canada" in the lookup aggregations, internally the endpoint will filter all the entities the have "Canada" in the "country" property and then get the aggregations of "country", this is done in only one query.

So if the entities that have "Canada" may also have other countries and we get a list of aggregations like this:

Ecuador 31
Canada 31
Peru 10
Argentina 7

And then we filter the ones matching the search term "Canada" and we return it.

This endpoint has an aggregation limit of 1000 possible options to avoid the bug, but let's say that with those same results you limit to 1. Then Elasticsearch will only return "Ecuador" and you won't find Canada.

Keep in mind that "Canada" will always have the top results since you are already filtering by entities that have "Canada", so with the 1000 limit, it can only happen if you search for something, that another 1000 options are also selected in all the entities with Canada so are also the top results like in the example with Ecuador, you'll need another 1000 countries also selected in those 31 results.

RafaPolit commented 4 years ago

And how will this extrapolate to scenarios with more than 1000 aggregations? This is very likely if we have aggregations of Document Name found in Paragraphs, as is already the case in certain data models.

This will mean this 1000 aggregation limit is also hiding positive results, even without a single search going on?

My main concern is, as I have voiced before, that I believe it is an error to assume everyone wants only the 'top hitting results'. What if I want to look into the documents with the least amount of paragraphs? For that, I either need to reach the results past the 1000+ mark, or have a way to sort aggregations with different criterias.

txau commented 4 years ago

@RafaPolit I'm sure we can always think of "what if" scenarios not covered by our current feature set. What matters is how we can offer a good default and invest in that feature. If we have more time or we decide that it's a big pain point for users, we can look into the "what ifs".

RafaPolit commented 4 years ago

What? I am asking how do we reach those aggregations past number 1000... for whatever reason that may be. This is not a “what if”, Konz has signaled out that we only get 1000 aggregations. We already have data models that require more, and we took them away as a precaution to avoid performance issues. After the performance refactor, it would make sense to put them back.

txau commented 4 years ago

@RafaPolit I was referring to:

What if I want to look into the documents with the least amount of paragraphs?

While we can always improve what we have, we need to carefully choose weere we expend our resources.

huridocs / uwazi

Lookup aggregations hidden positives #2873