rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.31k stars 886 forks source link

read_pickle support #16048

Open weidinger-c opened 3 months ago

weidinger-c commented 3 months ago

cuDF currently has no IO support for pickle. I would need this function, but currently it exits with the error:

AttributeError: module 'cudf' has no attribute 'read_pickle'

lithomas1 commented 3 months ago

Just to clarify, are you looking for this feature in cudf itself, or did you need this feature using something like cudf.pandas? (also, since you're planning on reading pickles, do you also want support for to_pickle?)

If you're just looking to pickle cudf objects, you can do this manually using the pickle module, e.g.

import cudf
import pickle
a = cudf.DataFrame({"a":[1,2,3]})
# Write to pickle
pickle.dump(a, open("cdf.pkl", "wb"))
# Read from pickle
pickled_a = pickle.load(open("cdf.pkl", "rb"))

# Confirming they are equal
cudf.testing.testing.assert_frame_equal(a, pickled_a)
weidinger-c commented 3 months ago

Thanks for the reply, I know that there is a dedicated pickle module. I just wanted to compare my code without any code changes as I thought that cuDF has feature parity with pandas df.

lithomas1 commented 3 months ago

Thanks for the reply, I know that there is a dedicated pickle module. I just wanted to compare my code without any code changes as I thought that cuDF has feature parity with pandas df.

Thanks for clarifying.

You might want to try cudf.pandas if you'd like to use cudf with zero code change from pandas. (Although there is also an issue with read_pickle there https://github.com/rapidsai/cudf/issues/15459)

wence- commented 3 months ago

I just wanted to compare my code without any code changes as I thought that cuDF has feature parity with pandas df.

Mostly, but not completely. Other than the API compatibility is there some aspect of pandas.read_pickle that is not supported by plain pickle.load?

weidinger-c commented 3 months ago

I just wanted to compare my code without any code changes as I thought that cuDF has feature parity with pandas df.

Mostly, but not completely. Other than the API compatibility is there some aspect of pandas.read_pickle that is not supported by plain pickle.load?

No, at least nothing I am aware. As I said, I just wanted to try out and test my lib with cudf with the least possible effort to see if it brings some performance gains.