ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.27k stars 595 forks source link

bug: BigQuery Create table with partition_by argument is not working #10022

Open ruiyang2015 opened 2 months ago

ruiyang2015 commented 2 months ago

What happened?

I am trying to use the BigQuery backend to create a table with a partition by expression, ibis is raising exception. here is a simple code to reproduce the error:

import ibis
c = ibis.bigquery.connect(...)
schema = ibis.schema([ ("c1", "date"), ("c2", "int")])
c.create_table('test_tb', partition_by='DATE_TRUNC(c1, MONTH)') # -> this line is throwing exception 

What version of ibis are you using?

9.4.0

What backend(s) are you using, if any?

BigQuery

Relevant log output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/ibis/backends/bigquery/__init__.py", line 1007, in create_table
    raise com.IbisError("One of the `schema` or `obj` parameter is required")
ibis.common.exceptions.IbisError: One of the `schema` or `obj` parameter is required
>>> c.create_table('test_tb', partition_by='DATE_TRUNC(c1, MONTH)', schema=schema)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/ibis/backends/bigquery/__init__.py", line 1099, in create_table
    self.raw_sql(sql)
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/ibis/backends/bigquery/__init__.py", line 710, in raw_sql
    return self.client.query_and_wait(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/google/cloud/bigquery/client.py", line 3601, in query_and_wait
    return _job_helpers.query_and_wait(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/google/cloud/bigquery/_job_helpers.py", line 414, in query_and_wait
    return _wait_or_cancel(
           ^^^^^^^^^^^^^^^^
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/google/cloud/bigquery/_job_helpers.py", line 562, in _wait_or_cancel
    return job.result(
           ^^^^^^^^^^^
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1676, in result
    while not is_job_done():
              ^^^^^^^^^^^^^
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 153, in retry_target
    _retry_error_helper(
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/google/api_core/retry/retry_base.py", line 212, in _retry_error_helper
    raise final_exc from source_exc
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 144, in retry_target
    result = target()
             ^^^^^^^^
  File "/Users/ruiyang/Go/src/github.com/ascend-io/ascend-core/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1625, in is_job_done
    raise job_failed_exception
google.api_core.exceptions.BadRequest: 400 Unrecognized name: D at [1:111]; reason: invalidQuery, location: query, message: Unrecognized name: D at [1:111]


### Code of Conduct

- [X] I agree to follow this project's Code of Conduct
cpcloud commented 1 month ago

It looks like we're not handling the single partition case here. Can you try putting the partition expression in a list, like ["DATE_TRUNC(c1, MONTH)"]?

Long term, we'll continue to support strings here, so the current implementation probably doesn't need to be adjusted except to handle the single-string input case.

We could perhaps discuss supporting deferreds here.

ruiyang2015 commented 1 month ago

It looks like we're not handling the single partition case here. Can you try putting the partition expression in a list, like ["DATE_TRUNC(c1, MONTH)"]?

Long term, we'll continue to support strings here, so the current implementation probably doesn't need to be adjusted except to handle the single-string input case.

We could perhaps discuss supporting deferreds here.

i tried to pass in as a list, also getting error, looks like this feature is not working.

gforsyth commented 1 month ago

Ok, yeah, this looks like something isn't getting formatted properly (I think) -- because this works to create an empty table:

[nav] In [6]: con.raw_sql(f"""CREATE TABLE
         ...:   {con.dataset}.newtable (transaction_id INT64, transaction_date DATE)
         ...: PARTITION BY
         ...:   DATE_TRUNC(transaction_date, MONTH)""")
Out[6]: <google.cloud.bigquery.table._EmptyRowIterator at 0x78220076a9b0>
gforsyth commented 1 month ago

Ok, this fails because we call sg.to_identifier on the inputs to partition_by which leaves us with this SQL:

CREATE TABLE `ibis-gbq`.`_92ff9aec474dffb9d08ce7a2d670d6e934152e1b`.`hi` (`c1` DATE, `c2` INT64) PARTITION BY (`DATE_TRUNC(c1, MONTH)`)

The above would work if we didn't have the function quoted, but unclear to me how we want to handle denoting which arguments to partition_by are identifiers and which aren't.

cpcloud commented 2 weeks ago

I haven't started working on this yet, but I should be able to get to it for the next release (10.0).

everdark commented 2 weeks ago

I realized pass in list works indeed. So for example:

conn.create_table("name", table, partition_by=["refresh_date"])

But may not work with function like DATE_TRUNC in the expression, as shown above.

cpcloud commented 2 weeks ago

Yep, the thing that doesn't work (reported in the OP) is passing a more complex expression for partitioning.