RFC: allow fetching estimated count

rikinsk commented 4 years ago

As PG count queries can be really expensive for large datasets, an option to return estimated count can be added.

API signature

Option 1:

We could add a new aggregate field called estimatedCount with the following signature. A flag called estimate can be returned in the output type mentioning if returned count is an estimate or not.

estimatedCount(
  columns: [author_select_column!]
  distinct: Boolean
): EstimatedCountOutput

EstimatedCountOutput(
  count: Int
  estimate: Boolean
)

Option 2:

There can be a new boolean argument that the count aggegate field accepts called estimate which would allow returning an estimate count if the actual count query is expensive. A flag called estimate can be returned in the output type mentioning if returned count is an estimate or not.

count(
  columns: [author_select_column!]
  distinct: Boolean
  estimate: Boolean
): CountOutput

CountOutput(
  count: Int
  estimate: Boolean
)

This would be a neater API but would add a breaking change as output type of the count field changes. Any suggested workarounds for this?

Note: An equivalent should be added to the RQL API as well

Implementation logic

The following logic could be used to return the estimated count:

Example: Let's say we need to fetch the count on an author table with the filters id _lt 1000 and id _gt 500 applied

Step 1: Attempt to get actual count
- Create a count query with the appropriate filters added to the where clause
```
SELECT count(*) FROM author WHERE id < 1000 AND id > 500
```
- Execute the count query after setting a transaction level statement timeout of 2sec.
```
SET LOCAL statement_timeout = 2000; 
SELECT count(*) FROM author WHERE id < 1000 AND id > 500
```
  If the query to fetch actual count request completes return the actual count with the estimated flag as false.
Step 2: Get estimated count
- Create a select query with the appropriate filters added to the where clause.
```
SELECT * FROM author WHERE id < 1000 AND id > 500
```
- Run the query with explain.
```
EXPLAIN SELECT * FROM author WHERE id < 1000 AND id > 500
```
- Extract the rows count from the response via the regex: / rows=([[:digit:]]+)/
```
QUERY PLAN
------------------------------------------------------
Seq Scan on author  (cost=0.00..35.00 rows=105 width=4)
Filter: ((id < 1000) AND (id > 500))
```
  will give estimated row count as 105
Return the estimated count with the estimated flag as true.

See https://wiki.postgresql.org/wiki/Count_estimate for more info

tirumaraiselvan commented 4 years ago

I think adding this as an optional argument inside count field makes more sense than having a top-level field because it is a special case of count (nothing fundamentally different, just different execution).

Also, I think approximate may be better than estimate but that is a minor point.

tirumaraiselvan commented 4 years ago

Why do we need to change the type of count field. Int is fine, right?

rikinsk commented 4 years ago

Why do we need to change the type of count field. Int is fine, right?

Depending on whether the actual count query succeeds or not, the returned count can be the actual count or an estimated count. I would like that information to be passed in the API response as I might need it to make a UI display decision (e.g. display Count = 100 vs Count = ~ 100 depending if the count is accurate/estimated). Hence the need to change the return type to include an "is_estimate" flag along with the count value

tirumaraiselvan commented 4 years ago

Ah ok, actually I was thinking we should only do step 2 . In which case the returned count is always estimate. It's a more predictable API (no irony intended).

If people want to try exact count and upon timeout, get estimate count, can they not do it from the client? Maybe by aborting the first request.

rikinsk commented 4 years ago

Sure but thats a lot more overhead for me as a user. I would prefer just letting the API know that I am ok with getting an estimate and let the API take care of the rest.

Also, aborting the request from the client may or may not abort the request on Postgres. Reasoning about these things is something I as a user wouldnt want to worry about.

tirumaraiselvan commented 4 years ago

What if I want to customize the timeout to be 5 seconds?

rikinsk commented 4 years ago

From some basic testing, 2 seconds was enough to return counts for tables with a few million rows (depends on the size of rows of course). Hence I am not sure if making the timeout configureable would be very important.

If we do want it to be configurable, we could probably add another optional argument to the API which will be respected only if the estimate argument is true, and add an ENV flag, e.g. HASURA_GRAPHQL_COUNT_TIMEOUT to set the value globally.

count(
  columns: [author_select_column!]
  distinct: Boolean
  estimate: Boolean
  countTimeout: Int
): CountOutput

njaremko commented 4 years ago

I think the answer depends on how willing Hasura is to make a breaking change. I've heard Tanmai say in videos that you're very afraid of breaking peoples stuff, so if that's the case, it seems like the only options here are:

Option 1: Make a new aggregation

estimatedCount(
  columns: [author_select_column!]
  distinct: Boolean
  countTimeout: Int 
  # Should "seconds" be in the name? 
  # Should this be a input type of { unit, scalar }?
): EstimatedCountOutput

EstimatedCountOutput(
  count: Int
  isEstimate: Boolean
)

Pros:

You can tell people whether or not the given count is actually an estimate
It's a bit more customizable (you can say "give me an estimate if this takes more than 3 seconds")

Cons:

You have to add another aggregation
More customization could make it less intuitive (i.e. do people that know they want an estimate if this takes more that 3 seconds actually expect it to take less than that? Sometimes. Do you want to support this use case?)

Option 2: Existing type signature with a new optional argument

count(
  columns: [author_select_column!]
  distinct: Boolean
  estimate: Boolean
): Int

Pros:

I think it's fine to just return an Int if estimate: true is specified. If I'm explicitly adding it to my count aggregation, I'm adding it because I know that I have lots of rows, and that getting the exact number is slow, so I'm expecting an estimate.
We don't have to allow them to specify a timeout, since we can just assume they want an estimate. Presumably you'd have some heuristic of 1 to 2 seconds (configurable with HASURA_GRAPHQL_COUNT_TIMEOUT) that would determine when an actual estimate happens, but the user would just assume they always get an estimate in this case.
It feels a bit more "discoverable", since people already familiar with Hasura's schema will reach for the count() aggregation, and autocomplete will expose the new estimate option with documentation. Though you could easily argue that estimatedCount would also autocomplete when people type "count".
You don't have to introduce another aggregation into the schema

Cons:

You don't get to tell people whether or not the answer is actually an estimate, but I've argued above that this might not be important.
As a consequence of the above con, you don't get to be fancy and ask for an estimate "if this takes more than 3 seconds"

Of those two options, I personally prefer the second.

lublak commented 1 year ago

any news about this? Option 2 (estimate: Boolean flag) would be awesome.

hasura / graphql-engine