rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.41k stars 899 forks source link

[BUG] dask_cudf from_delayed throws exception when meta is included in the from_delayed #9975

Open mlahir1 opened 2 years ago

mlahir1 commented 2 years ago

Exception:

Columns: [_col10, _col5, _col0, _col12, visit_date]
Index: [], 'from_delayed')
kwargs:    {}
Exception: "ValueError('Metadata mismatch found in `from_delayed`.\\n\\nExpected partition of type `pandas.core.frame.DataFrame` but got `cudf.core.dataframe.DataFrame`')"

Reproducer:

import cudf
import numpy as np
from dask.dataframe import from_delayed
from dask.delayed import delayed
import numpy as np
import cudf
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask_cudf

df = cudf.DataFrame()
for i in range(100):
    df[f'_col{i}'] = np.random.randint(20, size=100)

df.to_orc('test1.orc')
df.to_orc('test0.orc')

files = [(('a',), 'test0.orc'), (('a',), 'test1.orc')]
print(files)

def rd(f, cols, meta=None):
    if partitions:
        cols = list(set(cols) - set([x[0] for x in partitions]))
    if type(f).__name__ == 'str':
        f = ((), f)
    df = cudf.read_orc(f[1], columns=cols,  use_index=False)
    if partitions:
        for i, col in enumerate(partitions):
            df[col[0]] = f[0][i]
            df[col[0]] = df[col[0]].astype(col[1])
    if meta:
        return df[list(meta.keys())]
    else:
        return df

partitions =[('visit_date', 'str')]
cols = ['_col0', '_col12', '_col10', '_col5']
meta = dict(rd(files[0], cols, meta=None).dtypes)
for i in partitions:
    meta[i[0]] = i[1]

print(meta)

def main():
    c=Client(LocalCUDACluster())
    dfs=[delayed(rd)(f, cols=cols, meta=meta) for f in files[:2]]
    xx = dask_cudf.from_delayed(dfs, meta=meta)
    print(xx.compute())

if __name__ == "__main__":
    main()

this issue is very similar to issue here: https://github.com/dask/dask/issues/8528

the difference is from_delayed with meta, throws another exception "The columns in the computed data do not match the columns in the provided metadata" even though both dfs have same columns and dtypes. since we provide the meta here, we run into different issue.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.