tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.29k stars 1.54k forks source link

Parallel shard writer #856

Open meyerzinn opened 5 years ago

meyerzinn commented 5 years ago

Currently, we write all shards sequentially to disk when creating TFRecord files from a downloaded dataset. This makes it very slow for datasets like quickdraw_bitmap, which is only ~36GB to download but takes my work system (on a HPC cluster) about 14 hours to fully prepare (at 1000 records/second).

Could _TFRecordWriter (https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/tfrecords_writer.py#L38) use multiprocessing to simultaneously write to the shards and ensure the process is blocking on I/O and not CPU?

meyerzinn commented 5 years ago

Hm, it seems like it was not actually writing during the progress bar that took 4 hours. What comes between download/extraction and writing?

pierrot0 commented 5 years ago

Hi Meyer, thanks for reporting.

Which version of the dataset were you using? If you weren't specifying any, then it's legacy code. If '3..', then it's tfrecords_writer.py (S3 experiment, which will eventually the default).

The process is roughly the following:

  1. download
  2. extract if archive and if not read from archive directly
  3. For each split: 3.a. generate the records, write them to 1K temporary buckets based on hash(key). 3.b. For each bucket: sort the bucket in memory (using hash(key)), then write records into final tfrecords files.
  4. (optional) Compute statistics over the generated dataset (so read it all).

In this case there is no extraction part. Legacy code follows more or less the same pattern, except it write round-robin in shards and then sort the shards by hash(example).

If you have the logs, it would be interesting to look into them. If you did 4, then this step is indeed just reading the data.

Step 3.b. should be relatively fast. We are just moving bits (+a bit of sorting). I would suspect step 3.a to be slowest, and then step4 to be slow as well. Step4 is optional and there is some work to replace with tfdv, so not addressing this here.

It should be straightforward to parallelize the writing to temporary buckets. We just need to make sure only one write happens simultaneously on each bucket and we don't open too many files simultaneously. We will discuss that tomorrow, there might be a way to address this issue sooner than later.