Closed siddharthab closed 10 months ago
Technically speaking, Dask-CUDA has no compatibility issues with pandas 2, but for that to be useful you'll also need cuDF to support it and there's ongoing work for that, see https://github.com/rapidsai/cudf/pull/13535, which https://github.com/rapidsai/dask-cuda/pull/1213 is waiting for.
We are also happy to accept PRs in both Dask-CUDA and cuDF to expand support for libraries that our users need.
Thank you. I suppose, for this repo, just removing the version constraint in pyproject.toml will help a lot. Currently, that constraint stops us from using Pandas 2 in our dask job at all, even if we don't use cuDF.
Thank you. I suppose, for this repo, just removing the version constraint in pyproject.toml will help a lot. Currently, that constraint stops us from using Pandas 2 in our dask job at all, even if we don't use cuDF.
I do not necessarily oppose but I do have mixed feelings about this. On the one end I understand your ask, but ultimately Dask-CUDA is primarily meant to be used with GPU libraries, which in this case in particular implies cuDF. Removing the pin would loosely communicate "we support pandas 2 already" which is not true because we can't test it yet.
@galipremsagar @shwina @rjzamora @quasiben do you have thoughts on this? Perhaps the current cuDF pin to pandas<2
would suffice and we could unblock users who are in the situation described above?
In any case, the most recent plan is to have pandas 2 support in 24.04, which is due early April.
Thank you for your reply and thank you for considering the request.
which in this case in particular implies cuDF
I am not sure if that is the characterization everyone uses for Dask-CUDA currently, especially if you consider that cuDF is not even a listed dependency of Dask-CUDA. For example, we use Dask-CUDA for only LocalCUDACluster
for our ML batch prediction workflows (we don't have cuDF installed in our environment), without using distributed data frames. Our workflows use pandas to do some lightweight preprocessing before distributing the workload, but the version limit in this repo limits the pandas version in our environment. I think any version limits in this repo should be about the usage of pandas in this repo.
I think it's fine to rely on cudf's upper bound for pandas. dask-cuda users who aren't using cudf should be free to use newer versions of pandas if it works for them.
And this is actually now blocking cudf's ability to test our pandas 2 support with dask, so I'm going to go ahead and open a PR to lift this constraint. Let's hope using the latest pandas doesn't break any of dask-cuda's own tests!
Let's hope using the latest pandas doesn't break any of dask-cuda's own tests!
No worries - I'll be happy to investigate anything that breaks :)
Thanks @vyasr for taking care of this during my absence.
This was resolved by #1308 , closing.
Thank you everyone for such a prompt resolution.
Pandas 2.0.0 was released in April 2023. We should spend some effort to make this project compatible with the 2.y versions. Pandas 3 also has a dev release out, so maybe we can try for that as well now.