rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.28k stars 884 forks source link

[ENH]: serialization schema cleanup #10799

Open wence- opened 2 years ago

wence- commented 2 years ago

Followup from #10784. Hyphens and underscores are used inconsistently when separating names in metadata keys in serialize; go through and standardise on one choice (hyphens seem more popular).

wence- commented 2 years ago

Going through and doing the minimum thing to add frame_count slots to all serializable objects, a further thought occurred which is that as well as lack of consistency in key names, there's also a lack of consistency in the metadata schema.

Some of the metadata slots (in particular type-serialized) are picked to align with what dask names things; as far as I can tell this is not the case for others, and doesn't really need to be, but cc @jakirkham whom git-annotate suggests might know.

Sometimes, properties of nested objects are copied in to the parent header, and sometimes not, I think it makes sense to clean up and have a model of:

{"type-serialized": my-type,
  "properties": {attribute: value, nested_object: nested_stuff},
  "frame-count": number_of_frames_consumed_in_deserialization}

Perhaps something like this was considered and rejected?

A much larger change would be to set up all of the serializable objects as dataclass-like things with constructors that just set attributes, then the schema for serialization is completely clear and lots of this code can be removed. A downside here is that serialization is not as extensible in an ad-hoc manner, and I am not sure that all cudf classes can get away with default simple constructors.

bdice commented 2 years ago

A small consideration that might tip the balance: using underscores makes it map more directly to valid Python identifiers, e.g.

# Using underscores in the keys:
frame_count = metadata["frame_count"]
type_serialized = metadata["type_serialized"]
# vs. mapping dashes to underscores:
frame_count = metadata["frame-count"]
type_serialized = metadata["type-serialized"]
vyasr commented 2 years ago

I would definitely welcome more insight from a Dask expert. Some thoughts and questions:

It might help to consider whether we could change our classes so that only Frame and BaseIndex are Serializable. ColumnBase would still need its current implementation, but we could simplify dtype and Buffer logic. We shouldn't need non-API classes to conform to the dask.distributed API, and it leads to some incongruities:

jakirkham commented 2 years ago

Dask uses -s

Not sure I follow what else is being proposed here

vyasr commented 2 years ago

@jakirkham I think the two main questions for you are:

jakirkham commented 2 years ago

Yes "type-serialized" matters. It is a special field in Dask

When adding support for cuDF serialization, we found all sorts of objects went over the wire. Any we missed supporting surfaced as errors in benchmarks. So we added them all

I think what I'm missing is what we are trying to fix here

wence- commented 2 years ago

I would definitely welcome more insight from a Dask expert. Some thoughts and questions:

  • Are you proposing that nested headers are always copied into their parents? Does that mean that we always duplicate data?

On the contrary. I'm proposing:

header = {"properties": {}}
frames = []
sub_obj = self.child # object we're serializing has a child to be serialized
sub_header, sub_frames = sub_obj.serialize()
header["properties"]["child"] = sub_header
frames.extend(sub_frames)

At the moment, depending on the particular object, in serialization, sometimes this is done, sometimes some information is carried redundantly in the header itself. Moreover, if adding a new slot to the "top-level" header key space, one has to read (or know) non-local code to know whether there are any reserved keys. For example Serializable.device_serialize (which calls the class-implemented serialize) overwrites "type-serialized", "is-cuda", and "lengths", Serializable.host_serialize (called via pickle) additionally overwrites "writeable".

  • How would nesting work when nested objects that aren't buffers actually return frames? In particular, our CategoricalDtype serializes into both header information and a frame consisting of the categories (just curious if you've considered how this impacts your proposal, because I haven't).

This is not problematic. Deserialization takes a (nested) metadata descriptor and a list of frames and returns a deserialized object and a (partially) consumed list of frames. So a helper function:

def unpack(header, frames):
    typ = pickle.loads(header["type-serialized"])
    count = header["frame_count"]
    obj = typ.deserialize(header, frames[:count])
    return obj, frames[count:]

works to unfold the part of a nested definition. So suppose we were deserializing a column with a categorical dtype:

dtype_header = header["properties"]["dtype"]
dtype, frames = unpack(dtype_header, frames)
# continue with deserialization of other properties
  • You're right that simple dataclass constructors won't work for most of our classes like DataFrame or Series. Those must be constructible from a wide range of objects to match what pandas supports.

One way to square that circle (though it is a big API-breaking change) is to split the munging of data for __init__ into a free function. That is the API offers:

def DataFrame(args):
    # munge args
    processed_args = ...(args)
    return impl.DataFrame(processed_args)

~It may well not be worth it, however.~

EDIT: that's not possible due to API constraints (as pointed out below by @shwina).

It might help to consider whether we could change our classes so that only Frame and BaseIndex are Serializable. ColumnBase would still need its current implementation, but we could simplify dtype and Buffer logic. We shouldn't need non-API classes to conform to the dask.distributed API, and it leads to some incongruities:

  • dtypes include "frames" when serializing even though they don't actually have frames of data. The exception is categoricals, which are tricky because they are the only dtype that stores data on device (the categories). We would probably need a special case to support those.

The advantage of everything supporting the same interface is you don't need to do any special-casing. You just recurse calling serialize until the base case is hit. If you don't have this then any dtype-carrying object that needs to be serialized has to if isinstance(dtype, CategoricalDtype) I think.

  • Buffers ultimately just insert "self" into the frames, so every Column could really just insert its underlying buffer into the frames list and call that good. Buffer is an implementation detail of cudf and shouldn't need to conform to the dask API. We really just use the CAI protocol to save and load them anyway. That would remove one additional serialize implementation.

I think this would work, since the wire format is to effectively send all the frames out of band and the reconstruct on the other end. The column metadata can include enough information to rebuild/validate the buffer.

wence- commented 2 years ago

I think what I'm missing is what we are trying to fix here

Initially, I was adding support for serialization that was missing on struct columns (that was #10784). As part of that, the schema for the metadata headers seemed a bit inconsistent. So I am looking at if it is worth investing effort in cleaning that up a bit.

shwina commented 2 years ago

Just a drive-by comment here: DataFrame needs to be a type (as does Index). Unfortunately, we cannot make those into factory free-functions.

wence- commented 2 years ago

Just a drive-by comment here: DataFrame needs to be a type (as does Index). Unfortunately, we cannot make those into factory free-functions.

I'm guessing because code basically relies on isinstance(foo, cudf.DataFrame) and the like?

shwina commented 2 years ago

Yes -- and also there are classmethods defined on cudf.DataFrame that are in the public API; e.g., cudf.DataFrame.from_pandas().

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

wence- commented 1 year ago

I think anything we do here will need to be in tandem with proposed serialisation changes in dask/distributed that are being contemplated. So I'll revisit this then.

vyasr commented 4 months ago

@wence- has anything changed in dask since the last comment to move the needle here?

wence- commented 1 month ago

I'm a bit out of the loop. I think that they moved to allowing pickle/unpickle. But that doesn't fundamentally change things (since that works with gpu-backed data but necessitates a device-host transfer).