sodadata / soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
https://go.soda.io/core-docs
Apache License 2.0
1.93k stars 211 forks source link

Duplicate row samples fails with dotted field name for BQ nested column (struct) #1910

Open patricksurry opened 1 year ago

patricksurry commented 1 year ago

Given a BigQuery table t with a nested (struct) column offers.offer_id we can write checks normally using dotted field names like duplicate_count(offers.offer_id) = 0.

The check itself runs successfully but if there are duplicates the query to collect sample records fails. The query (below) is generated by sql_get_duplicates(). The problem is that the frequencies CTE ends up with a column called offer_id rather than offers.offer_id so the join fails. Editing the join condition to main.offers.offer_id = frequencies.offer_id fixes it.

Dealing with dotted field names in general seems difficult.
I'll work around the issue using a view with aliases the columns I'm interested in, but it would be cool if soda could do this automatically. For example alias the dotted column to a 'safe' version, like SELECT offers.offer_id as offers_offer_id ...?


WITH 
frequencies AS (
    SELECT offers.offer_id
    FROM t
    WHERE offers.offer_id IS NOT NULL
    GROUP BY offers.offer_id
    HAVING count(*) > 1
)
SELECT main.*
FROM t main
JOIN frequencies ON main.offers.offer_id = frequencies.offers.offer_id
jmarien commented 1 year ago

SODA-1787