tensorflow / ecosystem

Integration of TensorFlow with other open-source frameworks
Apache License 2.0
1.37k stars 391 forks source link

Add tfrecord gzip compression #125

Closed fhoering closed 5 years ago

fhoering commented 5 years ago

Tested in Python with (ZLIB option is not supported)

import tensorflow as tf
specs = {
    'data': tf.VarLenFeature(dtype=tf.int64)
}
dataset = tf.data.TFRecordDataset(["tf-records/part-m-00000"], compression_type="")
dataset = dataset.map(
    lambda x: tf.parse_single_example(x, features=specs)
)
next_tfr = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
    try:
        for i in range(10):
            print(sess.run(next_tfr))
    except tf.errors.OutOfRangeError:
        pass
import tensorflow as tf
specs = {
    'data': tf.VarLenFeature(dtype=tf.int64)
}
dataset = tf.data.TFRecordDataset(["tf-records/part-m-00000.gz"], compression_type="GZIP")
dataset = dataset.map(
    lambda x: tf.parse_single_example(x, features=specs)
)
next_tfr = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
    try:
        for i in range(10):
            print(sess.run(next_tfr))
    except tf.errors.OutOfRangeError:
        pass
googlebot commented 5 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

jhseu commented 5 years ago

Mind handling the CLA?

googlebot commented 5 years ago

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

jhseu commented 5 years ago

Have you tested the contents of the TFRecords files? The test just seems to parse them only.

fhoering commented 5 years ago

I changed the test again to have tests for a fixed dataset. I also compared the tf record content with what is read by the python code.