sodadata / soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
https://go.soda.io/core-docs
Apache License 2.0
1.89k stars 208 forks source link

Soda core error in SUM aggregation: integer overflow in Bigquery #1860

Open louisnow opened 1 year ago

louisnow commented 1 year ago

When querying Bigquery

SELECT
    avg(`document_id`) as average
    , sum(`document_id`) as sum
    , variance(`document_id`) as variance
    , stddev(`document_id`) as standard_deviation
    , count(distinct(`document_id`)) as distinct_values
    , sum(case when `document_id` is null then 1 else 0 end) as missing_values
FROM certified_data.pipeline_bookings

We end up with a 400 Error in SUM aggregation: integer overflow error

jmarien commented 1 year ago

SODA-1718

louisnow commented 1 year ago

Hey @jmarien I really appreciate the quick response. I've updated to the latest version of soda core but the error hasn't gone away.

I did some more digging on the column. It's actually an int64 created via the farm_fingerprint function in bigquery

https://cloud.google.com/bigquery/docs/reference/standard-sql/hash_functions

WITH example AS (
  SELECT 1 AS x, "foo" AS y, true AS z UNION ALL
  SELECT 2 AS x, "apple" AS y, false AS z UNION ALL
  SELECT 3 AS x, "" AS y, true AS z
)
SELECT
  *,
  FARM_FINGERPRINT(CONCAT(CAST(x AS STRING), y, CAST(z AS STRING)))
    AS row_fingerprint
FROM example;
/*---+-------+-------+----------------------*
 | x | y     | z     | row_fingerprint      |
 +---+-------+-------+----------------------+
 | 1 | foo   | true  | -1541654101129638711 |
 | 2 | apple | false | 2794438866806483259  |
 | 3 |       | true  | -4880158226897771312 |
 *---+-------+-------+----------------------*/
benjamin-pirotte commented 1 year ago

Hi! Do I understand you were able to find a workaround for it? Thanks!