rapidsai / dask-cudf

[ARCHIVED] Dask support for distributed GDF object --> Moved to cudf
https://github.com/rapidsai/cudf
Apache License 2.0
135 stars 58 forks source link

Joins Segfault #66

Closed mrocklin closed 5 years ago

mrocklin commented 5 years ago

Currently our joins segfault. This is probably most easily reproduced by running the current test suite.

I apply this patch to un-skip the tests

diff --git a/dask_cudf/tests/test_join.py b/dask_cudf/tests/test_join.py
index 828f73e..63b601c 100644
--- a/dask_cudf/tests/test_join.py
+++ b/dask_cudf/tests/test_join.py
@@ -10,7 +10,6 @@ import dask_cudf as dgd
 param_nrows = [5, 10, 50, 100]

-@pytest.mark.skip(reason="Join implementation not updated")
 @pytest.mark.parametrize("left_nrows", param_nrows)
 @pytest.mark.parametrize("right_nrows", param_nrows)
 @pytest.mark.parametrize("left_nkeys", [4, 5])

Then I run tests

mrocklin@dgx16:~/dask-cudf$ py.test dask_cudf/tests/test_join.py --verbose
========================================== test session starts ==========================================
platform linux -- Python 3.6.7, pytest-4.0.1, py-1.7.0, pluggy-0.8.0 -- /home/nfs/mrocklin/miniconda/bin/python
cachedir: .pytest_cache
rootdir: /home/nfs/mrocklin/dask-cudf, inifile:
collected 260 items

dask_cudf/tests/test_join.py::test_join_inner[4-4-5-5] Segmentation fault
mrocklin commented 5 years ago

Actually, that was on an old cudf build. Here is an exception that I get before I eventually get Aborted failures

left_nrows = 5, right_nrows = 5, left_nkeys = 4, right_nkeys = 4

    @pytest.mark.parametrize("left_nrows", param_nrows)
    @pytest.mark.parametrize("right_nrows", param_nrows)
    @pytest.mark.parametrize("left_nkeys", [4, 5])
    @pytest.mark.parametrize("right_nkeys", [4, 5])
    def test_join_inner(left_nrows, right_nrows, left_nkeys, right_nkeys):
        chunksize = 50

        np.random.seed(0)

        # cuDF
        left = gd.DataFrame(
            {
                "x": np.random.randint(0, left_nkeys, size=left_nrows),
                "a": np.arange(left_nrows),
            }.items()
        )
        right = gd.DataFrame(
            {
                "x": np.random.randint(0, right_nkeys, size=right_nrows),
                "a": 1000 * np.arange(right_nrows),
            }.items()
        )

        expect = left.set_index("x").join(
            right.set_index("x"), how="inner", sort=True, lsuffix="l", rsuffix="r"
        )
        expect = expect.to_pandas()

        # dask_cudf
        left = dgd.from_cudf(left, chunksize=chunksize)
        right = dgd.from_cudf(right, chunksize=chunksize)

        joined = left.set_index("x").join(
>           right.set_index("x"), how="inner", lsuffix="l", rsuffix="r"
        )

dask_cudf/tests/test_join.py:46:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask_cudf/core.py:339: in join
    meta = self._meta.join(other._meta, how=how, lsuffix=lsuffix, rsuffix=rsuffix)
../cudf/python/cudf/dataframe/dataframe.py:1234: in join
    rsuffix=rsuffix, method=method)
../cudf/python/cudf/dataframe/dataframe.py:1052: in merge
    method=method)
cudf/bindings/join.pyx:26: in cudf.bindings.join.join
    ???
cudf/bindings/join.pyx:122: in cudf.bindings.join.join
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   cudf.bindings.GDFError.GDFError: CUDA ERROR. b'cudaErrorInvalidDevicePointer': b'invalid device pointer'
kkraus14 commented 5 years ago

cc @dantegd looks like this could be at the Cython layer

mrocklin commented 5 years ago

It's an issue with merging on empty dataframes

import cudf
df = cudf.DataFrame({'x': []})
df.merge(df, on=['x'])
ERROR: CUDA Runtime call cudaPeekAtLastError() in line 566 of file /home/nfs/mrocklin/cudf/cpp/src/join/joining.cu failed with invalid device pointer (17).
---------------------------------------------------------------------------
GDFError                                  Traceback (most recent call last)
<ipython-input-3-a5f3db0f3305> in <module>
----> 1 df.merge(df, on=['x'])

~/cudf/python/cudf/dataframe/dataframe.py in merge(self, other, on, how, lsuffix, rsuffix, type, method)
   1050
   1051         cols, valids = cpp_join.join(lhs._cols, rhs._cols, on, how,
-> 1052                                      method=method)
   1053
   1054         df = DataFrame()

~/cudf/python/cudf/bindings/join.pyx in cudf.bindings.join.join()

~/cudf/python/cudf/bindings/join.pyx in cudf.bindings.join.join()

~/cudf/python/cudf/bindings/cudf_cpp.pyx in cudf.bindings.cudf_cpp.check_gdf_error()

GDFError: CUDA ERROR. b'cudaErrorInvalidDevicePointer': b'invalid device pointer'
mrocklin commented 5 years ago

Happy to move this to cudf if desired

mrocklin commented 5 years ago

(dask does this in order to get the dtypes and such for the output dataframe without doing any work)

kkraus14 commented 5 years ago

Was about to ask if this happens in the actual merge or the meta calculation. Could you raise an issue in cuDF about handling empty dataframes in merges? Thanks!

mrocklin commented 5 years ago

Will do. Sorry I didn't dive into this earlier.

kkraus14 commented 5 years ago

No worries, I think the bigger thing here is our error messages are cryptic at best 😅

jrhemstad commented 5 years ago

@mrocklin can you confirm that https://github.com/rapidsai/cudf/pull/691 fixes this issue?

mrocklin commented 5 years ago

Honestly I haven't yet set up a nice build process on my machine yet, so I may be slow to test this (also playing catch-up today). This is on my radar and something that is high priority for me, but I don't recommend blocking on my engagement here. If that PR fixes the cudf issue then I encourage you all to merge it.

mrocklin commented 5 years ago

I can raise more issues if the problem persists.

kkraus14 commented 5 years ago

Will merge the PR and close this issue once CI reports green, if there's subsequent issues lets open new issues to track them. Thanks!

mrocklin commented 5 years ago

This seems to be resolved. There are other join issues coming up that I'll discuss in https://github.com/rapidsai/dask-cudf/pull/67