ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.3k stars 537 forks source link

bug(bigquery): different name of column in sql code and table.columns #9112

Open jitingxu1 opened 2 weeks ago

jitingxu1 commented 2 weeks ago

What happened?

Here is the code could be used to reproduce the error:


import ibis
ibis.options.interactive = True

con = ibis.bigquery.connect(project_id="xxx", dataset_id="xxx")

t = ibis.memtable({
    "x": ["a", "b"],
    "y": ["0", "1"]
})
con.create_table(
    "t", t.to_pyarrow(), overwrite=True
)

t = con.table("t")
t = t.mutate(
    x1=ibis.literal("c"),
    y=_.y.cast(dt.int32),
)

x = t.drop("y")
y = t.y.cast(dt.int64)

when I run

y.as_table().columns

I got ['Cast(y, int64)']

But the name in sql code is different in ibis.to_sql(y)- name in sql code is Cast_y_ int64

SELECT
  CAST(CAST(`t0`.`y` AS INT64) AS INT64) AS `Cast_y_ int64`
FROM `voltrondata-demo`.`nycflights13`.`t` AS `t0`

This will produce errors in ibisml, becuase ibisml use the table's column name to extract target - https://github.com/ibis-project/ibis-ml/blob/main/ibisml/core.py#L135

import ibisml as ml

step = ml.TargetEncode(["x"])
table, targets, index = ml.core.normalize_table(x, y)
print(targets)
metadata = ml.core.Metadata(targets=targets)
step.fit_table(table,metadata)
step.transform_table(x)

Errror

Invalid field name "Cast(y, int64)_f82688".

***Updated the error message**

400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/voltrondata-demo/jobs?uploadType=multipart: Invalid field name "Cast(y, int64)_ef9a46". Fields must contain the allowed characters, and be at most 300 characters long. For allowed characters, please refer to https://cloud.google.com/bigquery/docs/schemas#column_names

Not very sure, we should solve this in ibis or ibisml.

What version of ibis are you using?

9

What backend(s) are you using, if any?

bigquery

Relevant log output

No response

Code of Conduct

jitingxu1 commented 2 weeks ago

Rewrite the problem description with an easy example

jitingxu1 commented 1 week ago

***Updated the error message**

400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/voltrondata-demo/jobs?uploadType=multipart: Invalid field name "Cast(y, int64)_ef9a46". Fields must contain the allowed characters, and be at most 300 characters long. For allowed characters, please refer to https://cloud.google.com/bigquery/docs/schemas#column_names

seems like Cast(y, int64)_ef9a46 is not a valid name for bigquery, because have letters after number.

gforsyth commented 1 week ago

are you running against main? That looks like the bigquery issues I fixed yesterday in #9141 and #9149

jitingxu1 commented 1 week ago

oh, no. Let me pull it from main

are you running against main? That looks like the bigquery issues I fixed yesterday in https://github.com/ibis-project/ibis/pull/9141 and https://github.com/ibis-project/ibis/pull/9149

cpcloud commented 1 week ago

We probably need to see the ibis expression. Column names that look like that should effectively never make it into bigquery.

Can you give a reproducible example that only uses Ibis? We definitely need to see what Ibis APIs you're using to produce this error.

jitingxu1 commented 1 week ago

we convert an intermediate table into a memtable in bis-ml, see link

The intermediate information in the expression tree got lost, unfortunately, the new name "Cast(y, float64)" is not a valid bigquery column name.

I think it is not an ibis issue, we could fix this in ibisml.

Here is the simplified code to reproduce the error:

import ibis
ibis.options.interactive = True

con = ibis.bigquery.connect(project_id="xx", dataset_id="xx")

t = ibis.memtable({
    "x": ["a", "b"],
    "y": ["0", "1"]
})
con.create_table(
    "t", t.to_pyarrow(), overwrite=True
)

t = con.table("t")

base_table = t.drop("y")
target_table = t.y.cast(dt.float64)

y_m = ibis.memtable(target_table.as_table().to_pyarrow())
y_m = y_m.mutate(x=ibis.literal("a"))

base_table.join(y_m, "x")

And the new name "Cast(y, float64)" is not a valid name in bigquery.

ibis.to_sql(y_m)
sql
SELECT
  "t0"."Cast(y, float64)",
  'a' AS "x"
FROM "ibis_pyarrow_memtable_cimh6vdzlvhyxlddji3g6jvugq" AS "t0"