Open lmeyerov opened 3 years ago
Hi @lmeyerov,
We were struggling with these issues some weeks ago and did not found a decent solution. As we are using Dask-cuDF behind a low-code solution we don't want the user to be struggling in finding the right data types when they are just loading the data. As @lmeyerov comments maybe are out of the scope of Dask-cuDF. For the RAPIDS team any idea of a possible approach to handling this?
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Is your feature request related to a problem? Please describe.
It's been frustrating adapting cudf -> dask_cudf kernels in two basic areas around cross-partition type mismatches:
ingest: loading json, csv, etc. that vary in column types across partitions: existence, nans, int vs float, etc. When the code writer isn't the user -- so a library, piece of software, a UI, this is common and you can't just workaround by specifying dtypes ahead of time
compute: when doing data cleaning (ex: date inference) or some algs, it's unclear what
meta
should be ahead of time, only after you actually do the calc. dask will sample the first df... which is often wrongDescribe the solution you'd like
dask_cudf ingest operators: an auto-coercion flag ("when columns are in conflict across partitions, coerce to the closest common type, like float or str")
dask_cudf map, concat, etc: same thing
Describe alternatives you've considered
It may also be possible to make each operator smarter via sampling or other tricks. dask core and some cudf io seems to be experimenting here.
I like explicit flags b/c of their predictability/reliability, and uniformity... but ultimately, whatever work :)
Additional context
By default, I'm guessing this issue will be ignored & deprioritized ;-)
Before doing that, it may be worth polling dask_cudf users -- not devs -- how they feel about this ;-) my bet is people spend a surprising % of their time on a few issues around here, well before actual perf