open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.29k stars 996 forks source link

Numeric data overflow issue in Redshift profiler ingestion. #16697

Closed mvpkarthik1 closed 2 months ago

mvpkarthik1 commented 3 months ago

The value in the after_etl_ts column is "1694590129000000" and the column type is int8 in redshift.

During profiler ingestion the bewlow query is run by Open metadata.

SELECT AVG(after_etl_ts) AS mean, COUNT(after_etl_ts) AS valuesCount, COUNT(DISTINCT after_etl_ts) AS distinctCount, MIN(after_etl_ts) AS min, MAX(after_etl_ts) AS max, SUM(CASE WHEN after_etl_ts IS NULL THEN 1 ELSE 0 END) AS nullCount, STDDEV_POP(after_etl_ts) AS stddev, SUM(CAST(after_etl_ts AS BIGINT)) AS sum FROM bronze.logbased5_tpch1_customer_copy LIMIT 1;

The column contains 1694590129000000 in all 277087 rows.

--------------------------------------------------PROBLEM: -----------------------------------

sqlalchemy.exc.InternalError: (psycopg2.errors.InternalError_) Numeric data overflow (addition)

DETAIL:


error: Numeric data overflow (addition)

code: 1058

context:

query: 18299531

location: numeric_bound.cpp:183

process: query0_103_18299531 [pid=27173]


[SQL: / {"app": "OpenMetadata", "version": "1.2.5.0"} /

after_etl_ts column is of type BIGINT (INT8 in Redshift), the issue is indeed related to the sum of a large number of large values exceeding the limit for BIGINT. Given that a single value in the column is 1694590129000000 and there are 277,087 rows, the sum would be: Sum=1694590129000000×277087=4693928334013000000000

This value exceeds the maximum allowable value for BIGINT in Redshift, which is 9,223,372,036,854,775,807.

-------------------------------------------SOLUTION --------------------------- To avoid the Numeric data overflow (addition) error when dealing with such large sums, you need to handle these large values properly by using the DECIMAL type for the summation and calculations. Here’s the revised query:

SELECT AVG(CAST(after_etl_ts AS DECIMAL(38, 0))) AS mean, COUNT(after_etl_ts) AS valuesCount, COUNT(DISTINCT after_etl_ts) AS distinctCount, MIN(after_etl_ts) AS min, MAX(after_etl_ts) AS max, SUM(CASE WHEN after_etl_ts IS NULL THEN 1 ELSE 0 END) AS nullCount, STDDEV_POP(CAST(after_etl_ts AS DECIMAL(38, 0))) AS stddev, SUM(CAST(after_etl_ts AS DECIMAL(38, 0))) AS sum FROM bronze.logbased5_tpch1_customer_copy LIMIT 1;

The above query works fine in DB.

Casting to DECIMAL(38, 0): This casting is necessary for AVG, STDDEV_POP, and SUM to handle large values without overflow. DECIMAL(38, 0) can handle very large numbers with up to 38 digits of precision. SUM with DECIMAL(38, 0): This ensures that the sum calculation can handle very large total values.

ayush-shah commented 3 months ago

@mvpkarthik1 I don't think STDDEV_POP would require any casting, I tried it with biggest number possible, and STDDEV_POP didn't fail as such for my data with biggest integer to test