tarrade / proj_multilingual_text_classification

Explore multilingal text classification using embedding, bert and deep learning architecture
Apache License 2.0
4 stars 1 forks source link

What is so special about the tfrecord format? #51

Closed vluechinger closed 4 years ago

vluechinger commented 4 years ago

Tfrecord is Tensorflow's own binary storage format for data files (including sequence data). With this format, a better performance of up to 50% can be reached since the files require less space on the disk, less time to copy and can be read more efficiently. This is especially useful for memory intensive data and batch processing.

The data is stored as a sequence of binary strings. To convert any data type, follow these major steps (already implemented in src/...):

  1. convert all entries into either tf.train.BytesList, tf.train.FloatList, or tf.train.Int64List (and strings into bytes), then into tf.train.Feature (for each feature), create dictionary out of them
  2. store each sample in tf.train.Example or tf.train.SequenceExample
  3. serialize it
  4. use a tf.python_io.TFRecordWriter to write it to disk

Find more information in the Tensorflow documentation.