pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.66k stars 1.8k forks source link

column series cannot be written if it is contains items that are list of nested struct #10878

Open evbo opened 11 months ago

evbo commented 11 months ago

Checks

Reproducible example

The following schema works 100% to write parquet files containing the structure I so desire, but with an unwanted items "container" as I call it below:

lazy_static! {
    pub static ref SUB_FIELDS: [Field; 2] = [
        Field {
            name: SmartString::from("value"),
            dtype: DataType::Utf8
        },
        Field {
            name: SmartString::from("avg"),
            dtype: DataType::Float32
        }
    ];
    pub static ref SUB_DTYPE: DataType = DataType::Struct(SUB_FIELDS.to_vec());

    // TODO: UNWANTED CONTAINER
    pub static ref SUBS_FIELDS: [Field; 1] = [Field {
        name: SmartString::from("items"),
        dtype: DataType::List(Box::new(SUB_DTYPE.to_owned()))
    }];
    pub static ref SUBS_DTYPE: DataType = DataType::Struct(SUBS_FIELDS.to_vec());

    pub static ref PRIME_FIELDS: [Field; 5] = [
        Field {
            name: SmartString::from("value"),
            dtype: DataType::Utf8
        },
        Field {
            name: SmartString::from("min"),
            dtype: DataType::Float32
        },
        Field {
            name: SmartString::from("max"),
            dtype: DataType::Float32
        },
        Field {
            name: SmartString::from("avg"),
            dtype: DataType::Float32
        },
        Field {
            name: SmartString::from("subs"),
            dtype: SUBS_DTYPE.to_owned()
        }
    ];
    pub static ref PRIME_DTYPE: DataType = DataType::Struct(PRIME_FIELDS.to_vec());

    // TODO: UNWANTED CONTAINER
    pub static ref PRIMES_FIELDS: [Field; 1] = [Field {
        name: SmartString::from("items"),
        dtype: DataType::List(Box::new(PRIME_DTYPE.to_owned()))
    }];
    pub static ref PRIMES_DTYPE: DataType = DataType::Struct(PRIMES_FIELDS.to_vec());
}

Issue description

If I do not include the "container" intermediary dtype, I get completely null-valued results, as shown below:

┌───────────┬────────────────────────────┐ │ id ┆ primes │ │ --- ┆ --- │ │ str ┆ struct[5] │ ╞═══════════╪════════════════════════════╡ │ id1234 ┆ {null,null,null,null,null} │ └───────────┴────────────────────────────┘ thread 'data::data_tests::datas_tests::test_parquet' panicked at 'assertion failed: (left == right) left: [Utf8, Struct([Field { name: "value", dtype: Utf8 }, Field { name: "min", dtype: Float32 }, Field { name: "max", dtype: Float32 }, Field { name: "avg", dtype: Float32 }, Field { name: "subs", dtype: List(Null) }])], right: [Utf8, Struct([Field { name: "value", dtype: Utf8 }, Field { name: "min", dtype: Float32 }, Field { name: "max", dtype: Float32 }, Field { name: "avg", dtype: Float32 }, Field { name: "subs", dtype: List(Struct([Field { name: "value", dtype: Utf8 }, Field { name: "avg", dtype: Float32 }])) }])]

Notice the expected schema is on the right, but on the left is the actually generated schema with a very strange List(Null) type. Also notice all fields are null, even though the data I provided is never null.

Yes, I'm trying to use polars for series of list of struct of structs (I know, that's bad...) but here's why I think it's important: often times when migrating a legacy solution (grandma's recipe), you don't have the freedom to change the dataset to columnar format. First, you must lift all existing business logic as-is, which inevitably leads to the likelihood of awkward structures of data based on however the original pipeline code utilized god classes and various config. By polars supporting the full richness of parquet-supported datatypes, it allows a gradual migration into less nested structures.

Expected behavior

I would expect non-null results, which currently can only be achieved if I do include the "items" container in my schema, so that it produces a schema like (dbg!(&df.dtypes());):

[
    Utf8,
    Struct(
        [
            Field {
                name: "items",
                dtype: List(
                    Struct(
                        [
                            Field {
                                name: "value",
                                dtype: Utf8,
                            },
                            Field {
                                name: "min",
                                dtype: Float32,
                            },
                            Field {
                                name: "max",
                                dtype: Float32,
                            },
                            Field {
                                name: "avg",
                                dtype: Float32,
                            },
                            Field {
                                name: "subs",
                                dtype: Struct(
                                    [
                                        Field {
                                            name: "items",
                                            dtype: List(
                                                Struct(
                                                    [
                                                        Field {
                                                            name: "value",
                                                            dtype: Utf8,
                                                        },
                                                        Field {
                                                            name: "avg",
                                                            dtype: Float32,
                                                        },
                                                    ],
                                                ),
                                            ),
                                        },
                                    ],
                                ),
                            },
                        ],
                    ),
                ),
            },
        ],
    ),
]

However, these extra "items" containers aren't desirable as they just add to the complexity of an already complex structure.

Installed versions

polars = { version = "0.32.1", features = ["lazy", "parquet", "dtype-struct", "strings"]}
jhorstmann commented 11 months ago

This looks a bit like the reverse of what I reported in jorgecarleitao/arrow2#1556, but on the writing side. Are you seeing the NULL values by reading again with polars/arrow2? It would be interesting to try reading the file with another tool, for example parquet-read from arrow-rs to verify whether the issue is on the reading or writing side.

evbo commented 11 months ago

@jhorstmann great question: Yes, I read the Polars Parquet output both with Apache spark and PyPolars, both of which have various debug printing features to verify dataframe contents and schema.

Also, for debugging sake I successfully wrote the parquet files using Apache Spark without the above work around ("items" struct that contains the underlying array). So I'm highly confident it's a Polars write bug/feature/hiccup.

Thanks! I didn't expect this issue to get responded to so quickly since it's a rather obscure use-case. I appreciate this community a lot!