add new `codec` option for compression in Spark-Tensorflow connector

vgod-dbx commented 5 years ago

With https://github.com/tensorflow/ecosystem/pull/125, it became possible to output gzipped TFrecords by setting spark.hadoop.mapreduce.output.fileoutputformat.compress in the global SparkConf. However, there's no way to only enable compression for individual DataFrame outputs.

This PR adds a new option codec to the Spark-Tensorflow connector for enabling compression in individual DataFrameWriter. With this, we don't need to set spark.hadoop.mapreduce.output.fileoutputformat.compress globally anymore.

Sample usage:

(
  dataframe
  .write
  .format('tfrecords')
  .option('codec', 'org.apache.hadoop.io.compress.GzipCodec')
  .save('sample.tfrecord.gz')
)

googlebot commented 5 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

vgod-dbx commented 5 years ago

I signed it!

googlebot commented 5 years ago

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

jhseu commented 5 years ago

@skavulya Mind doing a code review?

skavulya commented 5 years ago

@jhseu Sure, I'll review it. Thanks!

skavulya commented 5 years ago

@vgod-dbx Thank you so much for the contribution. It looks good. Please add a description and example usage of the codec option to the README under the features section before merge.

vgod-dbx commented 5 years ago

@skavulya README updated! Thanks for the review.

skavulya commented 5 years ago

@vgod-dbx Thanks! Looks great. @jhseu The PR is ready for merge

eggie5 commented 4 years ago

@vgod-dbx what version did this make it into? I'm on 1.13.1 and it seems to ignore the codec option...

acastelli1 commented 3 years ago

Hi it seems that the codec is ignored actually

tensorflow / ecosystem