ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.37k stars 602 forks source link

bug: BIGQUERY backend generates invalid query when calling .distinct on subset on table with array column #10553

Open greg-offerfit opened 6 hours ago

greg-offerfit commented 6 hours ago

What happened?

Given a GBQ table with the following schema:

id: int
int_ids: array<!int64>  (int64 REPEATED)

Calling table.distinct(on=["id"]).execute() fails with the following error:

google.api_core.exceptions.BadRequest: 400 The argument to ARRAY_AGG must not be an array type but was ARRAY at [7:5]; reason: invalidQuery, location: query, message: The argument to ARRAY_AGG must not be an array type but was ARRAY at [7:5]

I expected this to generate a valid query and return a DataFrame.

See below comment for minimal reproduction.

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

BigQuery

Relevant log output

google.api_core.exceptions.BadRequest: 400 The argument to ARRAY_AGG must not be an array type but was ARRAY<INT64> at [7:5]; reason: invalidQuery, location: query, message: The argument to ARRAY_AGG must not be an array type but was ARRAY<INT64> at [7:5]

Code of Conduct

greg-offerfit commented 6 hours ago

TO REPRODUCE:

Download the below json file and save it somewhere accessible by Python/Ibis prompt. The json file is just a single row that generates a GBQ table with the appropriate schema and values

Run the below code (or equivalent to you):

import ibis

con = ibis.bigquery.connect()  # INSERT ANY PROJECT/DATASET ARGS YOU NEED HERE
table = con.read_json(PATH_TO_DOWNLOADED_JSON_FILE)
dedupe = table.distinct(on=["id"]).execute()  # ERRORS HERE

The generated query is:

SELECT `t1`.`int_ids`, `t1`.`id` FROM (SELECT `t0`.`id`, ARRAY_AGG(`t0`.`int_ids` IGNORE NULLS LIMIT 1)[safe_offset(0)] AS `int_ids` FROM `TABLE` AS `t0` GROUP BY 1) AS `t1`

Which GBQ marks as invalid with the following error: The argument to ARRAY_AGG must not be an array type but was ARRAY<INT64> at [1:58]

array_sample.json