Closed cpcloud closed 2 years ago
42 files 42 suites 1h 25m 52s :stopwatch: 8 700 tests 6 318 :heavy_check_mark: 2 382 :zzz: 0 :x: 32 270 runs 24 081 :heavy_check_mark: 8 189 :zzz: 0 :x:
Results for commit 94d15f88.
:recycle: This comment has been updated with latest results.
cc @tswast
It's been a while since I worked on this part of the codebase, but IIRC one of the main purposes of of materialize
was to put off the resolution of overlapping column names outside of the join predicates. Is it necessary to disambiguate the column names even for keys that are included in equijoin predicates? For a $table0.$COL = $table1.$COL
predicate, wouldn't it be more intuitive for only $COL
to appear in the result rather than $COL_x
and $COL_y
?
@wesm Yeah, that came up #1223.
There's only a single case where it's possible to do this and that's inner joins where the join comparison operation is equality.
I walk through an example in that issue laying out why disambiguation is necessary to prevent information loss.
Given the number of issues and confusion that have come up because of materialize it seems like a reasonable tradeoff for users to have to handle the disambiguation.
It could be a useful follow up to investigate how much work it is to handle the single case where it is possible.
The main challenge is coming up with a reliable and maintainable function that takes a join predicate as input and returns a boolean indicating whether the involved keys are potentially ambiguous.
I think this requires full traversal of the expression, since any expression and subexpressions that aren't of the form
t1.$COL_t1 = t2.$COL_t2 AND ... AND tn1.$COL_tn1 = tn.COL_tn
would return False
and the only way to deduce that would be to traverse the entire predicate expression.
I've definitely been confused by materialize()
before, so +1 on this change. Tested this PR locally against the ibis-bigquery
package ~and didn't see any failures that would have been caused by this.~ That said, I'm not sure if we have test cases for the overlapping equijoin predicates Wes identified.
There do appear to be some ~whitespace~ changes since the last time I ran the test suite, ~but that's not such a big deal~:
E SELECT count(`foo`) AS `count`
E FROM (
E SELECT `string_col`, sum(`float_col`) AS `foo`
E FROM (
E SELECT `float_col`, `timestamp_col`, `int_col`, `string_col`
E FROM `swast-scratch.ibis_gbq_testing.functional_alltypes`
E - WHERE `timestamp_col` < @my_param
E ) t1
E + WHERE `timestamp_col` < @my_param
E GROUP BY 1
E ) t0
Actually, that failure isn't just whitespace, it moved WHERE timestamp_col < @my_param
to a different subquery.
Edit: I've also confirmed that this failure does not happen on master
Interesting. I'll need to add a test case.
Regarding overlapping predicates, we have a good number of tests for that style of usage, it's one of the more common things for pandas users to try.
@tswast Can you post the ibis expression for that query?
Sure thing. Here's the full test output, including the ibis expression:
(ibis-bigquery-dev) ➜ ibis-bigquery git:(d0e43ef) pytest tests/system/test_client.py::test_subquery_scalar_params -vv
============================= test session starts ==============================
platform darwin -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /usr/local/Caskroom/miniconda/base/envs/ibis-bigquery-dev/bin/python
cachedir: .pytest_cache
rootdir: /Users/swast/src/github.com/ibis-project/ibis-bigquery
plugins: cov-2.12.1, mock-3.6.1
collected 1 item
tests/system/test_client.py::test_subquery_scalar_params FAILED [100%]
=================================== FAILURES ===================================
_________________________ test_subquery_scalar_params __________________________
alltypes = BigQueryTable[r0, name=swast-scratch.ibis_gbq_testing.functional_alltypes]
index int64
Unnamed_0 i..._string_col string
string_col string
timestamp_col timestamp
year int64
month int64
project_id = 'swast-scratch', dataset_id = 'ibis_gbq_testing'
def test_subquery_scalar_params(alltypes, project_id, dataset_id):
t = alltypes
param = ibis.param("timestamp").name("my_param")
expr = (
t[["float_col", "timestamp_col", "int_col", "string_col"]][
lambda t: t.timestamp_col < param
]
.groupby("string_col")
.aggregate(foo=lambda t: t.float_col.sum())
.foo.count()
)
result = expr.compile(params={param: "20140101"})
expected = """\
SELECT count(`foo`) AS `count`
FROM (
SELECT `string_col`, sum(`float_col`) AS `foo`
FROM (
SELECT `float_col`, `timestamp_col`, `int_col`, `string_col`
FROM `{}.{}.functional_alltypes`
WHERE `timestamp_col` < @my_param
) t1
GROUP BY 1
) t0""".format(
project_id, dataset_id
)
> assert result == expected
E AssertionError: assert ('SELECT count(`foo`) AS `count`\n'\n 'FROM (\n'\n ' SELECT `string_col`, sum(`float_col`) AS `foo`\n'\n ' FROM (\n'\n ' SELECT `float_col`, `timestamp_col`, `int_col`, `string_col`\n'\n ' FROM `swast-scratch.ibis_gbq_testing.functional_alltypes`\n'\n ' ) t1\n'\n ' WHERE `timestamp_col` < @my_param\n'\n ' GROUP BY 1\n'\n ') t0') == ('SELECT count(`foo`) AS `count`\n'\n 'FROM (\n'\n ' SELECT `string_col`, sum(`float_col`) AS `foo`\n'\n ' FROM (\n'\n ' SELECT `float_col`, `timestamp_col`, `int_col`, `string_col`\n'\n ' FROM `swast-scratch.ibis_gbq_testing.functional_alltypes`\n'\n ' WHERE `timestamp_col` < @my_param\n'\n ' ) t1\n'\n ' GROUP BY 1\n'\n ') t0')
E SELECT count(`foo`) AS `count`
E FROM (
E SELECT `string_col`, sum(`float_col`) AS `foo`
E FROM (
E SELECT `float_col`, `timestamp_col`, `int_col`, `string_col`
E FROM `swast-scratch.ibis_gbq_testing.functional_alltypes`
E - WHERE `timestamp_col` < @my_param
E ) t1
E + WHERE `timestamp_col` < @my_param
E GROUP BY 1
E ) t0
tests/system/test_client.py:230: AssertionError
=========================== short test summary info ============================
FAILED tests/system/test_client.py::test_subquery_scalar_params - AssertionEr...
============================== 1 failed in 42.37s =============================
Is it really strange that this happens, given that nearly all the code I touched was related to joins. Having a look now.
@tswast It's this line that causes the change.
@tswast The query is still correct though isn't it?
Unfortunately, to remove materialize I had to remove some of ibis's overly aggressive predicate pushdown.
The generated code produced is probably not a good thing to explicitly test, as we will need to make changes from time to time to the to compilers that result in different generated code to fix bugs or add features.
@tswast The query is still correct though isn't it?
Yes. Appears to be. I'm able to execute it, which I wasn't sure I would be able to do because of the subquery location.
Unfortunately, to remove materialize I had to remove some of ibis's overly aggressive predicate pushdown.
Makes sense. BigQuery has its own query optimizer, so I'm not all too worried about that.
The generated code produced is probably not a good thing to explicitly test, as we will need to make changes from time to time to the to compilers that result in different generated code to fix bugs or add features.
We can update the tests to actually execute the SQL. It might increase test time by quite a bit, though.
@tswast On the other hand, this was a nice catch, so I can add this as a test to the generic SQL test suite and we can take on the burden of finding this first at least!
Added a test for the subquery case. Thanks @tswast!
Thinking about this some more, I think it's probably best to keep a simpler suffixes
API here. It's familiar elsewhere (pandas, dplyr) and more general renaming can be done with TableExpr.relabel
Hello @cpcloud! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
This PR removes the
.materialize()
method fromTableExpr
.This method causes a lot of confusion, and created a situation where doing simple things like
raised an exception.
It turns out that
materialize
isn't necessary, but a few changes were necessary:Selection
. This was necessary to get the pandas and dask backends to work without materialize. This was working before because the onus to project columns out was placed on the user.This is a breaking change for any code that uses
materialize
.~However, existing code that does not use
materialize
should continue to work as is.~There are some breaking changes introduced here in the case of overlapping column names.
If there are any overlapping column names, a suffix will be attached to both the left and right tables. So, in the case of
s.asof_join(t, "time")
the resulting schema will have both atime_x
and atime_y
column.See my explanation here https://github.com/ibis-project/ibis/issues/1223#issuecomment-345476096 for why this is the case.
~#1045~ the initial problem is addressed, but another bug was exposed Closes #1387 Closes #2427 Closes #2997 Closes #3090 Closes #3295