tensorflow / ecosystem

Integration of TensorFlow with other open-source frameworks
Apache License 2.0
1.37k stars 392 forks source link

add new `codec` option for compression in Spark-Tensorflow connector #131

Closed vgod-dbx closed 5 years ago

vgod-dbx commented 5 years ago

With https://github.com/tensorflow/ecosystem/pull/125, it became possible to output gzipped TFrecords by setting spark.hadoop.mapreduce.output.fileoutputformat.compress in the global SparkConf. However, there's no way to only enable compression for individual DataFrame outputs.

This PR adds a new option codec to the Spark-Tensorflow connector for enabling compression in individual DataFrameWriter. With this, we don't need to set spark.hadoop.mapreduce.output.fileoutputformat.compress globally anymore.

Sample usage:

(
  dataframe
  .write
  .format('tfrecords')
  .option('codec', 'org.apache.hadoop.io.compress.GzipCodec')
  .save('sample.tfrecord.gz')
)
googlebot commented 5 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

vgod-dbx commented 5 years ago

I signed it!

googlebot commented 5 years ago

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

jhseu commented 5 years ago

@skavulya Mind doing a code review?

skavulya commented 5 years ago

@jhseu Sure, I'll review it. Thanks!

skavulya commented 5 years ago

@vgod-dbx Thank you so much for the contribution. It looks good. Please add a description and example usage of the codec option to the README under the features section before merge.

vgod-dbx commented 5 years ago

@skavulya README updated! Thanks for the review.

skavulya commented 5 years ago

@vgod-dbx Thanks! Looks great. @jhseu The PR is ready for merge

eggie5 commented 4 years ago

@vgod-dbx what version did this make it into? I'm on 1.13.1 and it seems to ignore the codec option...

acastelli1 commented 3 years ago

Hi it seems that the codec is ignored actually