runt18 / google-bigquery

Automatically exported from code.google.com/p/google-bigquery
0 stars 0 forks source link

count(*) behaves unpredictably with repeated fields #488

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

select
count(*)
,count(0)
,count(payload.shas.message)

from [bigquery-public-data:samples.github_nested]

What is the expected output? What do you see instead?

I expect count(*) to return the number of records, and thus to always have the 
same value as count(0). In the above example, they differ. Moreover, if I 
comment out count(payload.shas.message), they become the same. Thus, the value 
of count(*) depends on the inclusion of other statements in the select-list, 
which is in itself undesirable. What seems to happen is that count(*) counts 
repeated values, but only if they are referenced at some point in the query. In 
practice, this makes count(*) unpredictable and thus to be avoided. This 
behaviour is completely undocumented.

Proposed solution:

count(*) should always return the same value as count(0). If people want to 
count some repeated field, they can do so in a controlled and predictable 
manner by applying flatten.

Note: this has been discussed on StackOverflow by me and others

http://stackoverflow.com/questions/32483766/how-exactly-is-the-value-of-count-de
termined-in-bigquery
http://stackoverflow.com/questions/35004336/implicit-flattening-in-bigquery?lq=1
http://stackoverflow.com/questions/33604080/bigquery-flattens-when-using-field-w
ith-same-name-as-repeated-field?lq=1

Original issue reported on code.google.com by oliver.u...@gmail.com on 5 Apr 2016 at 8:54

GoogleCodeExporter commented 8 years ago
This cannot be changed in the current query language, since too much depends on 
the existing behavior. As with most SQL issues, the new dialect (issue 448) has 
a more principled approach.

This behavior is documented somewhat obliquely in 
https://cloud.google.com/bigquery/query-reference under COUNT(*), where it says 
that it counts the number of *values* in the scope of the function. I'll file a 
bug internally to clarify our documentation.

Original comment by jcon...@google.com on 5 Apr 2016 at 4:25