rapidsai / cuxfilter

GPU accelerated cross filtering with cuDF.
https://docs.rapids.ai/api/cuxfilter/stable/
Apache License 2.0
275 stars 67 forks source link

Avoid importing from `dask_cudf.core` #593

Closed rjzamora closed 5 months ago

rjzamora commented 5 months ago

This PR is intended to resolve test failures related to the recent dask-expr migration in dask.dataframe. I noticed that cuxfilter was importing from dask_cudf.core (no longer allowed when query-planning is enabled). There may be other issues as well.

AjayThorve commented 5 months ago

@rjzamora adding the following env variable to both ./ci/test_python.sh and ci/test_wheel.sh should resolve the tests for now:

  1. ci/test_python.sh
    
    DASK_DATAFRAME__QUERY_PLANNING=False pytest \
2. `ci/test_wheel.sh`
```bash
DASK_DATAFRAME__QUERY_PLANNING=False python -m pytest -n 8 ./python/cuxfilter/tests
rjzamora commented 5 months ago

Thanks @AjayThorve !

Just to clarify: My primary goal here is to find/fix the problems with DASK_DATAFRAME__QUERY_PLANNING=True. Therefore, It's fine to switch of query-planning to unblock CI, but in this particular PR, I'd like to make sure query-planning is enabled.

AjayThorve commented 5 months ago

Thanks @AjayThorve !

Just to clarify: My primary goal here is to find/fix the problems with DASK_DATAFRAME__QUERY_PLANNING=True. Therefore, It's fine to switch of query-planning to unblock CI, but in this particular PR, I'd like to make sure query-planning is enabled.

okay, that makes sense! Thanks for the PR!

rjzamora commented 5 months ago

I am only seeing a failure in one test now (cuxfilter/tests/charts/datashader/test_graph_assets.py::TestGraphAssets::test_calc_connected_edges).

As far as I can tell, we are comparing the following output DataFrame:

     x    y  color
0  1.0  1.0   11.0
1  1.0  1.0   11.0
2  NaN  NaN    NaN
3  1.0  1.0   10.0
4  0.0  0.0   10.0
5  NaN  NaN    NaN

...to an "expected" DataFrame:

     x    y  color
0  1.0  1.0   10.0
1  0.0  0.0   10.0
2  NaN  NaN    NaN
3  1.0  1.0   11.0
4  1.0  1.0   11.0
5  NaN  NaN    NaN

Before I dig in too far here, is it possible that both results are correct? That is, does the order of the rows matter here, or just the relationship between "x", "y" and "color"? The new dask_cudf API now uses more upstream algorithms than the legacy API did. So, the ordering of some results may be different.

AjayThorve commented 5 months ago

@rjzamora should I go ahead and merge this? Lgtm.

rjzamora commented 5 months ago

should I go ahead and merge this? Lgtm.

Seems fine to me as long as you think the test_graph_assets.py change makes sense :)

AjayThorve commented 5 months ago

/merge