related-sciences / ensembl-genes

Extract the Ensembl genes catalog to simple tables
Other
17 stars 4 forks source link

Use `pyarrow` instead of `fastparquet` to write parquet data #17

Closed ravwojdyla closed 2 years ago

ravwojdyla commented 2 years ago

pyarrow is the default pandas parquet engine, it also by default works better across the ecosystem (including pyspark). Specifically genes.snappy.parquet data can't by read by pyspark 3.2.0, due to:

org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)) at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284)

Btw fastparquet has a spark compatible mode for timestamps times="int96".

Also from https://fastparquet.readthedocs.io/en/latest/releasenotes.html#id2:

nanosecond resolution times: the new extended “logical” types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both “converted” and “logical” type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing.

dhimmel commented 2 years ago

Thanks for reporting this issue and the PR!

IIRC I selected fastparquet over pyarrow since it seemed lighter weight as per this comment:

fastparquet library was only about 1.1mb, while pyarrow library was 176mb

Switching to pyarrow is adding about 30 seconds to build times as per https://github.com/related-sciences/ensembl-genes/pull/18/files#r805887127. It also gave me confidence that fastparquet was part of the dask GitHub organization.

I agree compatibility is paramount for the parquet outputs from this repo. One solution would be to add times="int96" with fastparquet, but I'm guessing that there might be future issues like this. @ravwojdyla is that your reasoning for switching to pyarrow... that it is more likely to be more compatible in the future?

ravwojdyla commented 2 years ago

@dhimmel I personally trust pyarrow more, it also seems to have sounder defaults + as you have mentioned there might be other issues.

dhimmel commented 2 years ago

Okay, rerunning exports with the pyarrow engine for pandas.DataFrame.to_parquet in this build.