ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.42k stars 544 forks source link

refactor(api): remove the materialize method from TableExprs #3339

Closed cpcloud closed 2 years ago

cpcloud commented 2 years ago

This PR removes the .materialize() method from TableExpr.

This method causes a lot of confusion, and created a situation where doing simple things like

t.join(s, t.foo == s.foo).select(["unambiguous_column"])

raised an exception.

It turns out that materialize isn't necessary, but a few changes were necessary:

  1. Add a join suffixes API (#2997) to resolve overlapping column names. This is a new API.
  2. Disable predicate pushdown in the case when there's a predicate on a Selection. This was necessary to get the pandas and dask backends to work without materialize. This was working before because the onus to project columns out was placed on the user.

This is a breaking change for any code that uses materialize.

~However, existing code that does not use materialize should continue to work as is.~

There are some breaking changes introduced here in the case of overlapping column names.

If there are any overlapping column names, a suffix will be attached to both the left and right tables. So, in the case of s.asof_join(t, "time") the resulting schema will have both a time_x and a time_y column.

See my explanation here https://github.com/ibis-project/ibis/issues/1223#issuecomment-345476096 for why this is the case.

~#1045~ the initial problem is addressed, but another bug was exposed Closes #1387 Closes #2427 Closes #2997 Closes #3090 Closes #3295

github-actions[bot] commented 2 years ago

Unit Test Results

       42 files         42 suites   1h 25m 52s :stopwatch:   8 700 tests   6 318 :heavy_check_mark: 2 382 :zzz: 0 :x: 32 270 runs  24 081 :heavy_check_mark: 8 189 :zzz: 0 :x:

Results for commit 94d15f88.

:recycle: This comment has been updated with latest results.

wesm commented 2 years ago

cc @tswast

wesm commented 2 years ago

It's been a while since I worked on this part of the codebase, but IIRC one of the main purposes of of materialize was to put off the resolution of overlapping column names outside of the join predicates. Is it necessary to disambiguate the column names even for keys that are included in equijoin predicates? For a $table0.$COL = $table1.$COL predicate, wouldn't it be more intuitive for only $COL to appear in the result rather than $COL_x and $COL_y?

cpcloud commented 2 years ago

@wesm Yeah, that came up #1223.

There's only a single case where it's possible to do this and that's inner joins where the join comparison operation is equality.

I walk through an example in that issue laying out why disambiguation is necessary to prevent information loss.

Given the number of issues and confusion that have come up because of materialize it seems like a reasonable tradeoff for users to have to handle the disambiguation.

It could be a useful follow up to investigate how much work it is to handle the single case where it is possible.

cpcloud commented 2 years ago

The main challenge is coming up with a reliable and maintainable function that takes a join predicate as input and returns a boolean indicating whether the involved keys are potentially ambiguous.

I think this requires full traversal of the expression, since any expression and subexpressions that aren't of the form

t1.$COL_t1 = t2.$COL_t2 AND ... AND tn1.$COL_tn1 = tn.COL_tn

would return False and the only way to deduce that would be to traverse the entire predicate expression.

tswast commented 2 years ago

I've definitely been confused by materialize() before, so +1 on this change. Tested this PR locally against the ibis-bigquery package ~and didn't see any failures that would have been caused by this.~ That said, I'm not sure if we have test cases for the overlapping equijoin predicates Wes identified.

There do appear to be some ~whitespace~ changes since the last time I ran the test suite, ~but that's not such a big deal~:

E           SELECT count(`foo`) AS `count`
E           FROM (
E             SELECT `string_col`, sum(`float_col`) AS `foo`
E             FROM (
E               SELECT `float_col`, `timestamp_col`, `int_col`, `string_col`
E               FROM `swast-scratch.ibis_gbq_testing.functional_alltypes`
E         -     WHERE `timestamp_col` < @my_param
E             ) t1
E         +   WHERE `timestamp_col` < @my_param
E             GROUP BY 1
E           ) t0
tswast commented 2 years ago

Actually, that failure isn't just whitespace, it moved WHERE timestamp_col < @my_param to a different subquery.

Edit: I've also confirmed that this failure does not happen on master

cpcloud commented 2 years ago

Interesting. I'll need to add a test case.

Regarding overlapping predicates, we have a good number of tests for that style of usage, it's one of the more common things for pandas users to try.

cpcloud commented 2 years ago

@tswast Can you post the ibis expression for that query?

tswast commented 2 years ago

Sure thing. Here's the full test output, including the ibis expression:

(ibis-bigquery-dev) ➜  ibis-bigquery git:(d0e43ef) pytest tests/system/test_client.py::test_subquery_scalar_params -vv
============================= test session starts ==============================
platform darwin -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /usr/local/Caskroom/miniconda/base/envs/ibis-bigquery-dev/bin/python
cachedir: .pytest_cache
rootdir: /Users/swast/src/github.com/ibis-project/ibis-bigquery
plugins: cov-2.12.1, mock-3.6.1
collected 1 item

tests/system/test_client.py::test_subquery_scalar_params FAILED          [100%]

=================================== FAILURES ===================================
_________________________ test_subquery_scalar_params __________________________

alltypes = BigQueryTable[r0, name=swast-scratch.ibis_gbq_testing.functional_alltypes]
  index           int64
  Unnamed_0       i..._string_col string
  string_col      string
  timestamp_col   timestamp
  year            int64
  month           int64
project_id = 'swast-scratch', dataset_id = 'ibis_gbq_testing'

    def test_subquery_scalar_params(alltypes, project_id, dataset_id):
        t = alltypes
        param = ibis.param("timestamp").name("my_param")
        expr = (
            t[["float_col", "timestamp_col", "int_col", "string_col"]][
                lambda t: t.timestamp_col < param
            ]
            .groupby("string_col")
            .aggregate(foo=lambda t: t.float_col.sum())
            .foo.count()
        )
        result = expr.compile(params={param: "20140101"})
        expected = """\
    SELECT count(`foo`) AS `count`
    FROM (
      SELECT `string_col`, sum(`float_col`) AS `foo`
      FROM (
        SELECT `float_col`, `timestamp_col`, `int_col`, `string_col`
        FROM `{}.{}.functional_alltypes`
        WHERE `timestamp_col` < @my_param
      ) t1
      GROUP BY 1
    ) t0""".format(
            project_id, dataset_id
        )
>       assert result == expected
E       AssertionError: assert ('SELECT count(`foo`) AS `count`\n'\n 'FROM (\n'\n '  SELECT `string_col`, sum(`float_col`) AS `foo`\n'\n '  FROM (\n'\n '    SELECT `float_col`, `timestamp_col`, `int_col`, `string_col`\n'\n '    FROM `swast-scratch.ibis_gbq_testing.functional_alltypes`\n'\n '  ) t1\n'\n '  WHERE `timestamp_col` < @my_param\n'\n '  GROUP BY 1\n'\n ') t0') == ('SELECT count(`foo`) AS `count`\n'\n 'FROM (\n'\n '  SELECT `string_col`, sum(`float_col`) AS `foo`\n'\n '  FROM (\n'\n '    SELECT `float_col`, `timestamp_col`, `int_col`, `string_col`\n'\n '    FROM `swast-scratch.ibis_gbq_testing.functional_alltypes`\n'\n '    WHERE `timestamp_col` < @my_param\n'\n '  ) t1\n'\n '  GROUP BY 1\n'\n ') t0')
E           SELECT count(`foo`) AS `count`
E           FROM (
E             SELECT `string_col`, sum(`float_col`) AS `foo`
E             FROM (
E               SELECT `float_col`, `timestamp_col`, `int_col`, `string_col`
E               FROM `swast-scratch.ibis_gbq_testing.functional_alltypes`
E         -     WHERE `timestamp_col` < @my_param
E             ) t1
E         +   WHERE `timestamp_col` < @my_param
E             GROUP BY 1
E           ) t0

tests/system/test_client.py:230: AssertionError
=========================== short test summary info ============================
FAILED tests/system/test_client.py::test_subquery_scalar_params - AssertionEr...
============================== 1 failed in 42.37s =============================
cpcloud commented 2 years ago

Is it really strange that this happens, given that nearly all the code I touched was related to joins. Having a look now.

cpcloud commented 2 years ago

@tswast It's this line that causes the change.

https://github.com/ibis-project/ibis/pull/3339/files#diff-9bfeef14b16e62a849fc8c72b04bfab4fc8cee65fa236f1ad54bd440fded7bcfR769

cpcloud commented 2 years ago

@tswast The query is still correct though isn't it?

cpcloud commented 2 years ago

Unfortunately, to remove materialize I had to remove some of ibis's overly aggressive predicate pushdown.

The generated code produced is probably not a good thing to explicitly test, as we will need to make changes from time to time to the to compilers that result in different generated code to fix bugs or add features.

tswast commented 2 years ago

@tswast The query is still correct though isn't it?

Yes. Appears to be. I'm able to execute it, which I wasn't sure I would be able to do because of the subquery location.

Unfortunately, to remove materialize I had to remove some of ibis's overly aggressive predicate pushdown.

Makes sense. BigQuery has its own query optimizer, so I'm not all too worried about that.

The generated code produced is probably not a good thing to explicitly test, as we will need to make changes from time to time to the to compilers that result in different generated code to fix bugs or add features.

We can update the tests to actually execute the SQL. It might increase test time by quite a bit, though.

cpcloud commented 2 years ago

@tswast On the other hand, this was a nice catch, so I can add this as a test to the generic SQL test suite and we can take on the burden of finding this first at least!

cpcloud commented 2 years ago

Added a test for the subquery case. Thanks @tswast!

cpcloud commented 2 years ago

Thinking about this some more, I think it's probably best to keep a simpler suffixes API here. It's familiar elsewhere (pandas, dplyr) and more general renaming can be done with TableExpr.relabel

pep8speaks commented 2 years ago

Hello @cpcloud! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2022-01-25 23:27:29 UTC