Open rikinsk opened 4 years ago
I think adding this as an optional argument inside count
field makes more sense than having a top-level field because it is a special case of count (nothing fundamentally different, just different execution).
Also, I think approximate
may be better than estimate
but that is a minor point.
Why do we need to change the type of count
field. Int
is fine, right?
Why do we need to change the type of count field. Int is fine, right?
Depending on whether the actual count query succeeds or not, the returned count can be the actual count or an estimated count. I would like that information to be passed in the API response as I might need it to make a UI display decision (e.g. display Count = 100
vs Count = ~ 100
depending if the count is accurate/estimated). Hence the need to change the return type to include an "is_estimate" flag along with the count value
Ah ok, actually I was thinking we should only do step 2 . In which case the returned count is always estimate. It's a more predictable API (no irony intended).
If people want to try exact count and upon timeout, get estimate count, can they not do it from the client? Maybe by aborting the first request.
Sure but thats a lot more overhead for me as a user. I would prefer just letting the API know that I am ok with getting an estimate and let the API take care of the rest.
Also, aborting the request from the client may or may not abort the request on Postgres. Reasoning about these things is something I as a user wouldnt want to worry about.
What if I want to customize the timeout to be 5 seconds?
From some basic testing, 2 seconds was enough to return counts for tables with a few million rows (depends on the size of rows of course). Hence I am not sure if making the timeout configureable would be very important.
If we do want it to be configurable, we could probably add another optional argument to the API which will be respected only if the estimate
argument is true
, and add an ENV flag, e.g. HASURA_GRAPHQL_COUNT_TIMEOUT
to set the value globally.
count(
columns: [author_select_column!]
distinct: Boolean
estimate: Boolean
countTimeout: Int
): CountOutput
I think the answer depends on how willing Hasura is to make a breaking change. I've heard Tanmai say in videos that you're very afraid of breaking peoples stuff, so if that's the case, it seems like the only options here are:
estimatedCount(
columns: [author_select_column!]
distinct: Boolean
countTimeout: Int
# Should "seconds" be in the name?
# Should this be a input type of { unit, scalar }?
): EstimatedCountOutput
EstimatedCountOutput(
count: Int
isEstimate: Boolean
)
Pros:
Cons:
count(
columns: [author_select_column!]
distinct: Boolean
estimate: Boolean
): Int
Pros:
Int
if estimate: true
is specified. If I'm explicitly adding it to my count aggregation, I'm adding it because I know that I have lots of rows, and that getting the exact number is slow, so I'm expecting an estimate. HASURA_GRAPHQL_COUNT_TIMEOUT
) that would determine when an actual estimate happens, but the user would just assume they always get an estimate in this case.count()
aggregation, and autocomplete will expose the new estimate
option with documentation. Though you could easily argue that estimatedCount
would also autocomplete when people type "count".Cons:
any news about this? Option 2 (estimate: Boolean flag) would be awesome.
As PG count queries can be really expensive for large datasets, an option to return estimated count can be added.
API signature
Option 1:
We could add a new aggregate field called
estimatedCount
with the following signature. A flag calledestimate
can be returned in the output type mentioning if returned count is an estimate or not.Option 2:
There can be a new boolean argument that the
count
aggegate field accepts calledestimate
which would allow returning an estimate count if the actual count query is expensive. A flag calledestimate
can be returned in the output type mentioning if returned count is an estimate or not.This would be a neater API but would add a breaking change as output type of the
count
field changes. Any suggested workarounds for this?Note: An equivalent should be added to the RQL API as well
Implementation logic
The following logic could be used to return the estimated count:
Example: Let's say we need to fetch the count on an
author
table with the filtersid _lt 1000
andid _gt 500
appliedStep 1: Attempt to get actual count
count
query with the appropriate filters added to thewhere
clauseIf the query to fetch actual count request completes return the actual count with the
estimated
flag asfalse
.Step 2: Get estimated count
select
query with the appropriate filters added to thewhere
clause./ rows=([[:digit:]]+)/
will give estimated row count as
105
Return the estimated count with the
estimated
flag astrue
.See https://wiki.postgresql.org/wiki/Count_estimate for more info