[FEA] dask_cudf cross-partition type coercions

lmeyerov commented 3 years ago

Is your feature request related to a problem? Please describe.

It's been frustrating adapting cudf -> dask_cudf kernels in two basic areas around cross-partition type mismatches:

ingest: loading json, csv, etc. that vary in column types across partitions: existence, nans, int vs float, etc. When the code writer isn't the user -- so a library, piece of software, a UI, this is common and you can't just workaround by specifying dtypes ahead of time
compute: when doing data cleaning (ex: date inference) or some algs, it's unclear what meta should be ahead of time, only after you actually do the calc. dask will sample the first df... which is often wrong

Describe the solution you'd like

dask_cudf ingest operators: an auto-coercion flag ("when columns are in conflict across partitions, coerce to the closest common type, like float or str")

dask_cudf map, concat, etc: same thing

Describe alternatives you've considered

It may also be possible to make each operator smarter via sampling or other tricks. dask core and some cudf io seems to be experimenting here.

I like explicit flags b/c of their predictability/reliability, and uniformity... but ultimately, whatever work :)

Additional context

By default, I'm guessing this issue will be ignored & deprioritized ;-)

Before doing that, it may be worth polling dask_cudf users -- not devs -- how they feel about this ;-) my bet is people spend a surprising % of their time on a few issues around here, well before actual perf

argenisleon commented 3 years ago

Hi @lmeyerov,

We were struggling with these issues some weeks ago and did not found a decent solution. As we are using Dask-cuDF behind a low-code solution we don't want the user to be struggling in finding the right data types when they are just loading the data. As @lmeyerov comments maybe are out of the scope of Dask-cuDF. For the RAPIDS team any idea of a possible approach to handling this?

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapidsai / cudf

[FEA] dask_cudf cross-partition type coercions #7742