Open meyerzinn opened 5 years ago
Hm, it seems like it was not actually writing during the progress bar that took 4 hours. What comes between download/extraction and writing?
Hi Meyer, thanks for reporting.
Which version of the dataset were you using? If you weren't specifying any, then it's legacy code. If '3..', then it's tfrecords_writer.py (S3 experiment, which will eventually the default).
The process is roughly the following:
In this case there is no extraction part. Legacy code follows more or less the same pattern, except it write round-robin in shards and then sort the shards by hash(example).
If you have the logs, it would be interesting to look into them. If you did 4, then this step is indeed just reading the data.
Step 3.b. should be relatively fast. We are just moving bits (+a bit of sorting). I would suspect step 3.a to be slowest, and then step4 to be slow as well. Step4 is optional and there is some work to replace with tfdv, so not addressing this here.
It should be straightforward to parallelize the writing to temporary buckets. We just need to make sure only one write happens simultaneously on each bucket and we don't open too many files simultaneously. We will discuss that tomorrow, there might be a way to address this issue sooner than later.
Currently, we write all shards sequentially to disk when creating TFRecord files from a downloaded dataset. This makes it very slow for datasets like
quickdraw_bitmap
, which is only ~36GB to download but takes my work system (on a HPC cluster) about 14 hours to fully prepare (at 1000 records/second).Could
_TFRecordWriter
(https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/tfrecords_writer.py#L38) use multiprocessing to simultaneously write to the shards and ensure the process is blocking on I/O and not CPU?