Open wence- opened 2 years ago
Going through and doing the minimum thing to add frame_count
slots to all serializable objects, a further thought occurred which is that as well as lack of consistency in key names, there's also a lack of consistency in the metadata schema.
Some of the metadata slots (in particular type-serialized
) are picked to align with what dask names things; as far as I can tell this is not the case for others, and doesn't really need to be, but cc @jakirkham whom git-annotate
suggests might know.
Sometimes, properties of nested objects are copied in to the parent header, and sometimes not, I think it makes sense to clean up and have a model of:
{"type-serialized": my-type,
"properties": {attribute: value, nested_object: nested_stuff},
"frame-count": number_of_frames_consumed_in_deserialization}
Perhaps something like this was considered and rejected?
A much larger change would be to set up all of the serializable objects as dataclass
-like things with constructors that just set attributes, then the schema for serialization is completely clear and lots of this code can be removed. A downside here is that serialization is not as extensible in an ad-hoc manner, and I am not sure that all cudf classes can get away with default simple constructors.
A small consideration that might tip the balance: using underscores makes it map more directly to valid Python identifiers, e.g.
# Using underscores in the keys:
frame_count = metadata["frame_count"]
type_serialized = metadata["type_serialized"]
# vs. mapping dashes to underscores:
frame_count = metadata["frame-count"]
type_serialized = metadata["type-serialized"]
I would definitely welcome more insight from a Dask expert. Some thoughts and questions:
CategoricalDtype
serializes into both header information and a frame consisting of the categories (just curious if you've considered how this impacts your proposal, because I haven't).dataclass
constructors won't work for most of our classes like DataFrame
or Series
. Those must be constructible from a wide range of objects to match what pandas supports.It might help to consider whether we could change our classes so that only Frame
and BaseIndex
are Serializable
. ColumnBase
would still need its current implementation, but we could simplify dtype and Buffer
logic. We shouldn't need non-API classes to conform to the dask.distributed
API, and it leads to some incongruities:
Buffer
is an implementation detail of cudf and shouldn't need to conform to the dask API. We really just use the CAI protocol to save and load them anyway. That would remove one additional serialize
implementation.Dask uses -
s
Not sure I follow what else is being proposed here
@jakirkham I think the two main questions for you are:
cudf.Buffer
or each dtype to support the protocols? Of course, we need a way to serialize that information, i.e. we need to know whether a Series contains strings or ints, but here we're just talking about necessary APIs for each type of object. Yes "type-serialized"
matters. It is a special field in Dask
When adding support for cuDF serialization, we found all sorts of objects went over the wire. Any we missed supporting surfaced as errors in benchmarks. So we added them all
I think what I'm missing is what we are trying to fix here
I would definitely welcome more insight from a Dask expert. Some thoughts and questions:
- Are you proposing that nested headers are always copied into their parents? Does that mean that we always duplicate data?
On the contrary. I'm proposing:
header = {"properties": {}}
frames = []
sub_obj = self.child # object we're serializing has a child to be serialized
sub_header, sub_frames = sub_obj.serialize()
header["properties"]["child"] = sub_header
frames.extend(sub_frames)
At the moment, depending on the particular object, in serialization, sometimes this is done, sometimes some information is carried redundantly in the header
itself. Moreover, if adding a new slot to the "top-level" header key space, one has to read (or know) non-local code to know whether there are any reserved keys. For example Serializable.device_serialize
(which calls the class-implemented serialize
) overwrites "type-serialized"
, "is-cuda"
, and "lengths"
, Serializable.host_serialize
(called via pickle) additionally overwrites "writeable"
.
- How would nesting work when nested objects that aren't buffers actually return frames? In particular, our
CategoricalDtype
serializes into both header information and a frame consisting of the categories (just curious if you've considered how this impacts your proposal, because I haven't).
This is not problematic. Deserialization takes a (nested) metadata descriptor and a list of frames and returns a deserialized object and a (partially) consumed list of frames. So a helper function:
def unpack(header, frames):
typ = pickle.loads(header["type-serialized"])
count = header["frame_count"]
obj = typ.deserialize(header, frames[:count])
return obj, frames[count:]
works to unfold the part of a nested definition. So suppose we were deserializing a column with a categorical dtype:
dtype_header = header["properties"]["dtype"]
dtype, frames = unpack(dtype_header, frames)
# continue with deserialization of other properties
- You're right that simple
dataclass
constructors won't work for most of our classes likeDataFrame
orSeries
. Those must be constructible from a wide range of objects to match what pandas supports.
One way to square that circle (though it is a big API-breaking change) is to split the munging of data for __init__
into a free function. That is the API offers:
def DataFrame(args):
# munge args
processed_args = ...(args)
return impl.DataFrame(processed_args)
~It may well not be worth it, however.~
EDIT: that's not possible due to API constraints (as pointed out below by @shwina).
It might help to consider whether we could change our classes so that only
Frame
andBaseIndex
areSerializable
.ColumnBase
would still need its current implementation, but we could simplify dtype andBuffer
logic. We shouldn't need non-API classes to conform to thedask.distributed
API, and it leads to some incongruities:
- dtypes include "frames" when serializing even though they don't actually have frames of data. The exception is categoricals, which are tricky because they are the only dtype that stores data on device (the categories). We would probably need a special case to support those.
The advantage of everything supporting the same interface is you don't need to do any special-casing. You just recurse calling serialize until the base case is hit. If you don't have this then any dtype-carrying object that needs to be serialized has to if isinstance(dtype, CategoricalDtype)
I think.
- Buffers ultimately just insert "self" into the frames, so every Column could really just insert its underlying buffer into the frames list and call that good.
Buffer
is an implementation detail of cudf and shouldn't need to conform to the dask API. We really just use the CAI protocol to save and load them anyway. That would remove one additionalserialize
implementation.
I think this would work, since the wire format is to effectively send all the frames out of band and the reconstruct on the other end. The column metadata can include enough information to rebuild/validate the buffer.
I think what I'm missing is what we are trying to fix here
Initially, I was adding support for serialization that was missing on struct columns (that was #10784). As part of that, the schema for the metadata headers seemed a bit inconsistent. So I am looking at if it is worth investing effort in cleaning that up a bit.
Just a drive-by comment here: DataFrame
needs to be a type (as does Index
). Unfortunately, we cannot make those into factory free-functions.
Just a drive-by comment here:
DataFrame
needs to be a type (as doesIndex
). Unfortunately, we cannot make those into factory free-functions.
I'm guessing because code basically relies on isinstance(foo, cudf.DataFrame)
and the like?
Yes -- and also there are classmethods defined on cudf.DataFrame
that are in the public API; e.g., cudf.DataFrame.from_pandas()
.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
I think anything we do here will need to be in tandem with proposed serialisation changes in dask/distributed that are being contemplated. So I'll revisit this then.
@wence- has anything changed in dask since the last comment to move the needle here?
I'm a bit out of the loop. I think that they moved to allowing pickle/unpickle. But that doesn't fundamentally change things (since that works with gpu-backed data but necessitates a device-host transfer).
Followup from #10784. Hyphens and underscores are used inconsistently when separating names in metadata keys in
serialize
; go through and standardise on one choice (hyphens seem more popular).