Open Rridley7 opened 1 year ago
Thank you @Rridley7 for raising this. We should be able to process this shape of data in dask_cudf
Hi, I found that for regular dask, the recommended workflow is to use map_partitions
, then groupby (e.g. https://saturncloud.io/docs/troubleshooting/package-support/dask/dask_groupby_aggregations/ ) . This may also be the case for dask_cudf?
Describe the bug A clear and concise description of what the bug is. I am loading a large dataframe (~60M x 300) by csv via dask_cudf, then looking to do a groupby and sum, and resave this to csv. I get an OOM error - I am using an A100-80GB gpu along with 200GB of RAM.
All rows are numerical values, besides the groupby row left as the index. Thus, this error should be reproducible via a random dataframe. I noted a similar issue @10426, however this error message is different, therefore I was unsure if this was the case. Additionally, I do repeatedly get a high cpu garbage collection message, however I assume that is because of the size of the dataframe and many read/writes, correct me if that is not the case. Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
Output (I think the error message is repeating after nanny restarts, but I have included the entire error message for thoroughness (attached as file for size): dask_to_csv_error.txt
Expected behavior A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context Add any other context about the problem here.