hasura / graphql-engine

Blazing fast, instant realtime GraphQL APIs on your DB with fine grained access control, also trigger webhooks on database events.
https://hasura.io
Apache License 2.0
31.17k stars 2.77k forks source link

RFC: allow fetching estimated count #5010

Open rikinsk opened 4 years ago

rikinsk commented 4 years ago

As PG count queries can be really expensive for large datasets, an option to return estimated count can be added.

API signature

Option 1:

We could add a new aggregate field called estimatedCount with the following signature. A flag called estimate can be returned in the output type mentioning if returned count is an estimate or not.

estimatedCount(
  columns: [author_select_column!]
  distinct: Boolean
): EstimatedCountOutput

EstimatedCountOutput(
  count: Int
  estimate: Boolean
)

Option 2:

There can be a new boolean argument that the count aggegate field accepts called estimate which would allow returning an estimate count if the actual count query is expensive. A flag called estimate can be returned in the output type mentioning if returned count is an estimate or not.

count(
  columns: [author_select_column!]
  distinct: Boolean
  estimate: Boolean
): CountOutput

CountOutput(
  count: Int
  estimate: Boolean
)

This would be a neater API but would add a breaking change as output type of the count field changes. Any suggested workarounds for this?

Note: An equivalent should be added to the RQL API as well

Implementation logic

The following logic could be used to return the estimated count:

Example: Let's say we need to fetch the count on an author table with the filters id _lt 1000 and id _gt 500 applied

See https://wiki.postgresql.org/wiki/Count_estimate for more info

tirumaraiselvan commented 4 years ago

I think adding this as an optional argument inside count field makes more sense than having a top-level field because it is a special case of count (nothing fundamentally different, just different execution).

Also, I think approximate may be better than estimate but that is a minor point.

tirumaraiselvan commented 4 years ago

Why do we need to change the type of count field. Int is fine, right?

rikinsk commented 4 years ago

Why do we need to change the type of count field. Int is fine, right?

Depending on whether the actual count query succeeds or not, the returned count can be the actual count or an estimated count. I would like that information to be passed in the API response as I might need it to make a UI display decision (e.g. display Count = 100 vs Count = ~ 100 depending if the count is accurate/estimated). Hence the need to change the return type to include an "is_estimate" flag along with the count value

tirumaraiselvan commented 4 years ago

Ah ok, actually I was thinking we should only do step 2 . In which case the returned count is always estimate. It's a more predictable API (no irony intended).

If people want to try exact count and upon timeout, get estimate count, can they not do it from the client? Maybe by aborting the first request.

rikinsk commented 4 years ago

Sure but thats a lot more overhead for me as a user. I would prefer just letting the API know that I am ok with getting an estimate and let the API take care of the rest.

Also, aborting the request from the client may or may not abort the request on Postgres. Reasoning about these things is something I as a user wouldnt want to worry about.

tirumaraiselvan commented 4 years ago

What if I want to customize the timeout to be 5 seconds?

rikinsk commented 4 years ago

From some basic testing, 2 seconds was enough to return counts for tables with a few million rows (depends on the size of rows of course). Hence I am not sure if making the timeout configureable would be very important.

If we do want it to be configurable, we could probably add another optional argument to the API which will be respected only if the estimate argument is true, and add an ENV flag, e.g. HASURA_GRAPHQL_COUNT_TIMEOUT to set the value globally.

count(
  columns: [author_select_column!]
  distinct: Boolean
  estimate: Boolean
  countTimeout: Int
): CountOutput
njaremko commented 4 years ago

I think the answer depends on how willing Hasura is to make a breaking change. I've heard Tanmai say in videos that you're very afraid of breaking peoples stuff, so if that's the case, it seems like the only options here are:

Option 1: Make a new aggregation

estimatedCount(
  columns: [author_select_column!]
  distinct: Boolean
  countTimeout: Int 
  # Should "seconds" be in the name? 
  # Should this be a input type of { unit, scalar }?
): EstimatedCountOutput

EstimatedCountOutput(
  count: Int
  isEstimate: Boolean
)

Pros:

Cons:

Option 2: Existing type signature with a new optional argument

count(
  columns: [author_select_column!]
  distinct: Boolean
  estimate: Boolean
): Int

Pros:

Cons:

Of those two options, I personally prefer the second.

lublak commented 1 year ago

any news about this? Option 2 (estimate: Boolean flag) would be awesome.