Closed ravwojdyla closed 2 years ago
Thanks for reporting this issue and the PR!
IIRC I selected fastparquet over pyarrow since it seemed lighter weight as per this comment:
fastparquet library was only about 1.1mb, while pyarrow library was 176mb
Switching to pyarrow is adding about 30 seconds to build times as per https://github.com/related-sciences/ensembl-genes/pull/18/files#r805887127. It also gave me confidence that fastparquet was part of the dask GitHub organization.
I agree compatibility is paramount for the parquet outputs from this repo. One solution would be to add times="int96"
with fastparquet, but I'm guessing that there might be future issues like this. @ravwojdyla is that your reasoning for switching to pyarrow... that it is more likely to be more compatible in the future?
@dhimmel I personally trust pyarrow more, it also seems to have sounder defaults + as you have mentioned there might be other issues.
Okay, rerunning exports with the pyarrow engine for pandas.DataFrame.to_parquet
in this build.
pyarrow
is the defaultpandas
parquet engine, it also by default works better across the ecosystem (including pyspark). Specifically genes.snappy.parquet data can't by read by pyspark 3.2.0, due to:Btw fastparquet has a spark compatible mode for timestamps
times="int96"
.Also from https://fastparquet.readthedocs.io/en/latest/releasenotes.html#id2: