spark-tensorflow-connector: gzip codec ignored in latest master version (scala 2.12)

tensorflow / ecosystem

Integration of TensorFlow with other open-source frameworks

Apache License 2.0

1.37k stars 391 forks source link

spark-tensorflow-connector: gzip codec ignored in latest master version (scala 2.12) #172

Open tekumara opened 3 years ago

tekumara commented 3 years ago

https://github.com/tensorflow/ecosystem/pull/131 introduced the codec option, eg:

    df.write.format("tfrecords").option("recordType", "Example")
      .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
      .mode(SaveMode.Overwrite)
      .save(path)

This works using org.tensorflow:spark-tensorflow-connector_2.11:1.15.0 from maven central.

However this setting is ignored when building from master ie: commit 77abfa1 (Scala 2.12 and Spark 3.0.0)

acastelli1 commented 3 years ago

Do we have any news about this?

TylerBrabham commented 3 years ago

+1 This is blocking me.

dx-xp-team commented 3 years ago

+1 Same for me. This is blocking.

acastelli1 commented 3 years ago

Any news about the matter, this is a bit annoying. It's happening for me writing in Google Cloud Storage

mNemlaghi commented 3 years ago

Just in case anyone is still blocked by this, I managed to overcome this issue and tfrecords with gzip compression, with Spark 3.0.0 and Scala 2.12.10. I simply replaced spark-tensorflow connector jar with spark-tfrecord_2.12-0.3.0.jar jar. The latter seems to be based upon the former. It comes with a slight change within the code, though : replacing tfrecords with tfrecord. This might eventually work:

df.write.format("tfrecord").option("recordType", "Example") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .mode(SaveMode.Overwrite) .save(path)