Closed conderls closed 1 year ago
Well, in your Pandas, column d is not Long. Check what is the type of d and use the proper Scala type.
seq
, but not d
field.
And I have tried and successfully decoded the following data
pd.DataFrame([[1, "a"]], columns=["d", "s"]).to_parquet("/tmp/data.parquet")
// successfully print parsed result:
TestItem(1,a)
failed with data
// data generated with pandas
pd.DataFrame([[["a", "b"]]], columns=["s"]).to_parquet("/tmp/data.parquet")
// scala
case class TestItem(s: Seq[String])
Seq((1L, "a", Seq("a", "b"))).toDF("d", "s", "seq").write.parquet("file:///tmp/data_spark")
the parquet4s can successfully read the data.
so there may be something mismatched between pandas and scala to read/write parquet.
Ah, yes. I wrongly read the logs. Pandas encapsulated elements of "seq" within yet another object with field "item". If you are using the latest version of Parquet4s and the issue persists then it means that Parquet4s doesn't recognise that format of lists. So, either propose a fix to Parquet4s or add an encapsulating case class around items of "seq".
pt., 8 wrz 2023, 10:56 użytkownik conderls @.***> napisał:
the error message shows that it failed to decode field seq, but not d field. And I have tried and successfully decoded the following data pd.DataFrame([[1, "a"]], columns=["d", "s"]).to_parquet("/tmp/data.parquet")
// successfully print parsed result:TestItem(1,a)
— Reply to this email directly, view it on GitHub https://github.com/mjakubowski84/parquet4s/issues/312#issuecomment-1711319627, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACY66ZWD7G3C6GUPSGN6OODXZLMVBANCNFSM6AAAAAA4QAUGFQ . You are receiving this because you commented.Message ID: @.***>
Ah, yes. I wrongly read the logs. Pandas encapsulated elements of "seq" within yet another object with field "item". If you are using the latest version of Parquet4s and the issue persists then it means that Parquet4s doesn't recognise that format of lists. So, either propose a fix to Parquet4s or add an encapsulating case class around items of "seq".
# schema of data generated by pandas(pyarrow<13.0.0, one should upgrade it >=13.0.0 to keep consistent schema with spark)
schema: org.apache.parquet.schema.MessageType =
message schema {
optional int64 d;
optional binary s (UTF8);
optional group seq (LIST) {
repeated group list {
optional binary item (UTF8);
}
}
}
schema: org.apache.parquet.schema.MessageType =
message spark_schema {
required int64 d;
optional binary s (UTF8);
optional group seq (LIST) {
repeated group list {
optional binary element (UTF8); <--- element
vs item
}
}
}
I finally found the solution according to [pyarrow docs](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table)
with option `use_compliant_nested_type=True` during saving data to parquet with pandas
`df.to_parquet("/tmp/data.parquet", use_compliant_nested_type=True)`
the default value of `use_compliant_nested_type=True` since pyarrow=13.0.0, so one should set the option to True explicitly or update the pyarrow>=13.0.0.
Legacy pyarrow nested types will be supported also in Parquet4s 2.14.0
I created some test data with pandas=2.0.3 python=3.7
and parquet4s failed to read it
with decode error raised, which seems failed to decode list data:
if the test data is wrote with
com.github.mjakubowski84.parquet4s.ParquetWriter
, it just works perfectly.