rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.27k stars 884 forks source link

[FEA] Nested types support #2857

Closed jlowe closed 3 years ago

jlowe commented 4 years ago

Is your feature request related to a problem? Please describe. cudf columns should support compound data types (e.g.: structs, lists).

Describe the solution you'd like Using the same data layout as Arrow would be nice for compatibility. A struct would have child columns and a validity vector (so the struct itself can be null, since a struct of null fields is semantically different than a null struct). A list would contain the standard validity vector, a data vector containing the concatenated data across all rows, and an offset vector. The offset vector indicates the start location of each row's list of data. Therefore a row's data list starts at the indicated offset and ends at the offset of the next row.

jrhemstad commented 4 years ago

I've changed the title since "compound" has a specific semantic meaning within libcudf++. Compound types refer to any type that has children, e.g., strings, dictionaries, nested, etc.

drabastomek commented 4 years ago

I cannot stress enough how I would love to see this...

revans2 commented 4 years ago

I would like to add that Spark has native support for maps. There has been some confusion in the Arrow documentation about maps, but generally they are represented as a List of Key, Value structs. List<Struct<Key, Value>> The main reason I add this is because parquet and orc both support map types and it would be good to have a "standard" representation that we can all agree on.

BartleyR commented 4 years ago

This would also be useful for us for a number of our use cases, including cyBERT post-processing where we have to remove overlapping columns between rows (created as an artifact of the training/inference phase).

ntadimeti commented 4 years ago

Would love to have this feature.

pinireisman commented 4 years ago

This will be invaluable for us as we use lists as elements in pandas dataframes alot, and would love to switch to cudf!

jrhemstad commented 3 years ago

Going to close this as libcudf now has both struct and list types. Support is not complete across all functions, but individual issues can be filed if specific functionality is missing.