Closed jlowe closed 3 years ago
I've changed the title since "compound" has a specific semantic meaning within libcudf++. Compound types refer to any type that has children, e.g., strings, dictionaries, nested, etc.
I cannot stress enough how I would love to see this...
I would like to add that Spark has native support for maps. There has been some confusion in the Arrow documentation about maps, but generally they are represented as a List of Key, Value structs. List<Struct<Key, Value>>
The main reason I add this is because parquet and orc both support map types and it would be good to have a "standard" representation that we can all agree on.
This would also be useful for us for a number of our use cases, including cyBERT post-processing where we have to remove overlapping columns between rows (created as an artifact of the training/inference phase).
Would love to have this feature.
This will be invaluable for us as we use lists as elements in pandas dataframes alot, and would love to switch to cudf!
Going to close this as libcudf now has both struct and list types. Support is not complete across all functions, but individual issues can be filed if specific functionality is missing.
Is your feature request related to a problem? Please describe. cudf columns should support compound data types (e.g.: structs, lists).
Describe the solution you'd like Using the same data layout as Arrow would be nice for compatibility. A struct would have child columns and a validity vector (so the struct itself can be null, since a struct of null fields is semantically different than a null struct). A list would contain the standard validity vector, a data vector containing the concatenated data across all rows, and an offset vector. The offset vector indicates the start location of each row's list of data. Therefore a row's data list starts at the indicated offset and ends at the offset of the next row.