Open theelderbeever opened 4 months ago
Also read/write between the use_pyarrow=True
is equivalent
df = pl.DataFrame(data)
df.write_parquet("test.parquet", use_pyarrow=True)
df == pl.read_parquet("test.parquet")
plan { bool } |
---|
true |
true |
This is fixed in 0.20.31
@cmdlineluser I completely didn't catch that there was a release just 2 days ago... Just upgraded.
@cmdlineluser Still broken for read operations when the internal values are Decimals AND some other type.
from decimal import Decimal
print(pl.__version__)
pl.Config.activate_decimals(True)
df = pl.DataFrame(
[
{
"tiers": [
{
"in_tier": 10.0,
"overage_cents": Decimal("0E-12"),
},
{
"in_tier": 0.0,
"overage_cents": Decimal("0E-12"),
},
]
},
{
"tiers": [
{
"in_tier": 10.0,
"overage_cents": Decimal("0.001000000000"),
}
]
},
]
)
print(df.schema)
df.write_parquet("tiers.parquet")
pl.read_parquet("tiers.parquet")
Additionally, the decimal values inside the struct aren't being written or read from the file... use_pyarrow=True
during the write correctly writes the decimal values.
from decimal import Decimal
print(pl.__version__)
pl.Config.activate_decimals(True)
df = pl.DataFrame(
[
{
"tiers": [
{
# "in_tier": 10.0,
"overage_cents": Decimal("0E-12"),
},
{
# "in_tier": 0.0,
"overage_cents": Decimal("0E-12"),
},
]
},
{
"tiers": [
{
# "in_tier": 10.0,
"overage_cents": Decimal("0.001000000000"),
}
]
},
]
)
print(df.schema)
print(df)
df.write_parquet("tiers.parquet")
print(pl.read_parquet("tiers.parquet"))
"""
0.20.31
OrderedDict([('tiers', List(Struct({'overage_cents': Decimal(precision=None, scale=12)})))])
| tiers |
| --- |
| list[struct[1]] |
|--------------------------------------|
| [{0.000000000000}, {0.000000000000}] |
| [{0.001000000000}] |
| tiers |
| --- |
| list[struct[1]] |
|-----------------|
"""
D'oh - apologies.
Just for reference, the previous report was
(But wasn't decimal related.)
@cmdlineluser no worries. Want me to open a separate issue for decimals specifically?
Checks
Marked this as a python bug since that is where I encountered it however, I would expect the same bug to exist in Rust.
Reproducible example
Minimum reproducible example that I can figure out. Removal of ANY row/field or unnesting the top level struct results in a success.
Table
Log output
Issue description
I am attempting to write out a parquet file of data that I fetched from the Stripe api. The api json response is extremely nested. When writing the data structure in the example the write fails due to a differing number of children. If
use_pyarrow=True
is set then the write will be successful.From trial and error it seems to very specifically require a column which is a struct containing a struct field and a list field. Any values deeper than
col.struct.{struct,list}
don't appear to affect the outcome and the list can in fact be empty and it will still fail.Expected behavior
Dataframe should write to parquet successfully.
Installed versions