ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.37k stars 602 forks source link

bug: Inconsistent I/O behavior in BigQuery backend regarding dataset specification #10547

Open everdark opened 4 days ago

everdark commented 4 days ago

What happened?

Ibis has an inconsistent behavior when it comes to reading and writing tables to BigQuery dataset.

When reading a table, ibis does not require a connection to specify the dataset_id in ibis.bigquery.connect. We can then specify the dataset with either a namespaced table name such as dataset_name.table_name or use the database argument in reading a table. And ibis will raise if we specify both. It will also raise when no dataset_id is specified in connection and no namespace or database are provided in reading table.

For example, the following code will raise IbisInputError: Cannot specify database both in the table name and as an argument:

conn = ibis.bigquery.connect("project", location="region")
conn.table("test1.test", data, database="test2")

and the followings are fine, both can read the table test from dataset test1:

conn = ibis.bigquery.connect("project", location="region")
conn.table("test1.test", data)
conn.table("test", data, database="test1")

So far so good, however, things are very different as we are writing a table.

When writing a table, namespaced table name does NOT work, which means a dataset must be specified as either connection argument (dataset_id) or a saving argument (database). And the later overwrite the former. A surprising behavior is that, when BOTH a namespaced table and a database argument are specified, the dataset in namespace overwrites the argument.

For example, the following will raise ValueError: Unable to determine BigQuery dataset.:

conn = ibis.bigquery.connect("project", location="region")
conn.create_table("test1.test", data)

and this will (surprisingly) save the table test to test1.

conn = ibis.bigquery.connect("project", location="region")
conn.create_table("test1.test", data, database="test2")

which is rather confusing.

The expected behavior (for consistency) should be that we can either save the table using namespaced table name without dataset argument, or we can save it without namespaced table name but with a dataset argument, and it should raise when both are specified.

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

BigQuery

Relevant log output

No response

Code of Conduct

gforsyth commented 4 days ago

Thanks for reporting this, @everdark !

BigQuery is the only backend (I think) that supports dotted path locations as the table name, which dates back to when it was independently maintained as a third-party backend.

We should definitely resolve the idiosyncrasies here.