Closed VibhuJawa closed 4 years ago
Confirmed I can reproduce. Seems like we're losing index names in general when computing. More general reproducer:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf
import cudf
cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1")
client = Client(cluster)
df = cudf.DataFrame({'a':[1,2,3,4],'b':[5,1,2,5]})
df = df.set_index('b')
dask_df = dask_cudf.from_cudf(df,npartitions=2)
computed_index = dask_df.index.compute()
assert(computed_index.name == "b")
Good reproducer @kkraus14 - It seems that the problem shows up even without using dask_cudf
(i.e. using LocalCluster
)
This may have the same cause as #3420, but cudf does seem to be preserve the index name in a round-trip serialization. After repro above:
(h, f) = df[:0].serialize()
cudf.DataFrame.deserialize(h,f).index.name
Output:
'b'
Looks like the issue is that we're not hitting the DataFrame serialization/deserialization, but rather pickling:
In [10]: a
Out[10]:
a
b
3 1
4 2
5 3
In [11]: pickle.loads(pickle.dumps(a))
Out[11]:
a
3 1
4 2
5 3
I can submit a fix for the pickling but it looks like there's an orthogonal problem here.
..it looks like there's an orthogonal problem here.
Indeed - I'm trying to figure out why we are pickling everything.
Group by aggregate fails in distributed settings
Below code fails when launched with
LocalCUDACluster
but works without it.Stack Trace:
Additional context:
Below Fails
Below Works
Environment Info
This environment is after the https://github.com/rapidsai/cudf/pull/3741 was merged in and i can confirm that the issues in https://github.com/rapidsai/cudf/issues/3719 are resolved in my environment.
CC: @rjzamora @beckernick @shwina