Closed eriknw closed 2 years ago
This is a bug in dask, rather than dask_cudf. The finalize
method of dask.dataframe.Series
doesn't know anything about the keyword arguments passed to concat
:
import pandas as pd
import dask.dataframe as dd
s = pd.Series([1,2,3], index=pd.RangeIndex(4, 7))
ds = dd.from_pandas(s, npartitions=1)
assert (s.index == ds.compute().index).all()
cs = pd.concat([s], ignore_index=True)
dcs = dd.concat([ds], ignore_index=True)
assert (cs.index == dcs.compute().index).all() => AssertionError
Dask doesn't purport to support concat
with ignore_index=True
. dask_cudf
does. See: https://github.com/dask/dask/issues/8353
dask_cudf
does [claim to supportignore_index=True
]
Where do you see this? The definition of concat
in dask_cudf is just concat = dask.dataframe.concat
.
[I'm not arguing that this shouldn't be supported, btw]
That is a great question. I... don't recall why I thought that. Maybe stepping through pdb twists the brain?
In other news, this now works for me (and is a sufficient workaround):
dask_cudf.concat([ds1.reset_index(drop=True), ds2.reset_index(drop=True)]).compute()
Previously, the index dtype would remain the same (such as int32) after reset_index
. Now the index dtype changes to int64. Either that, or I was seeing things earlier.
I'm now content with my workaround; happy to close this issue.
That is a great question. I... don't recall why I thought that. Maybe stepping through pdb twists the brain?
:). Will close, but watch the dask side issue, which if fixed would fix this too.
Describe the bug
dask_cudf.concat(args, ignore_index=True)
raises when indices are different dtype, so the indices are not in fact ignored.Steps/Code to reproduce bug
Traceback
```python-traceback --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In [9], line 1 ----> 1 dask_cudf.concat([ds1, ds2], ignore_index=True).compute() File ~/miniconda3/envs/cugraph_dev13/lib/python3.9/site-packages/dask/base.py:315, in DaskMethodsMixin.compute(self, **kwargs) 291 def compute(self, **kwargs): 292 """Compute this dask collection 293 294 This turns a lazy Dask collection into its in-memory equivalent. (...) 313 dask.base.compute 314 """ --> 315 (result,) = compute(self, traverse=False, **kwargs) 316 return result File ~/miniconda3/envs/cugraph_dev13/lib/python3.9/site-packages/dask/base.py:601, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs) 598 postcomputes.append(x.__dask_postcompute__()) 600 results = schedule(dsk, keys, **kwargs) --> 601 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)]) File ~/miniconda3/envs/cugraph_dev13/lib/python3.9/site-packages/dask/base.py:601, inExpected behavior To not raise and give the same result as
cudf
Environment overview (please complete the following information)
Environment details
Click here to see environment details
Additional context This actually happens from within the
finalize
method of dask.dataframe.