Closed vgod-dbx closed 5 years ago
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
:memo: Please visit https://cla.developers.google.com/ to sign.
Once you've signed (or fixed any issues), please reply here (e.g. I signed it!
) and we'll verify it.
ℹ️ Googlers: Go here for more info.
I signed it!
@skavulya Mind doing a code review?
@jhseu Sure, I'll review it. Thanks!
@vgod-dbx Thank you so much for the contribution. It looks good. Please add a description and example usage of the codec option to the README under the features section before merge.
@skavulya README updated! Thanks for the review.
@vgod-dbx Thanks! Looks great. @jhseu The PR is ready for merge
@vgod-dbx what version did this make it into? I'm on 1.13.1 and it seems to ignore the codec
option...
Hi it seems that the codec is ignored actually
With https://github.com/tensorflow/ecosystem/pull/125, it became possible to output gzipped TFrecords by setting
spark.hadoop.mapreduce.output.fileoutputformat.compress
in the globalSparkConf
. However, there's no way to only enable compression for individual DataFrame outputs.This PR adds a new option
codec
to the Spark-Tensorflow connector for enabling compression in individual DataFrameWriter. With this, we don't need to setspark.hadoop.mapreduce.output.fileoutputformat.compress
globally anymore.Sample usage: