Open ohinds opened 10 months ago
100MB doesn't make sense on fast disk systems like we have on openmind or for brain imaging data. i believe we have played with TB sized shards as well. i would make this a user controllable parameter.
Well, the default currently produces tfrecord files sizes of about 20MB, so that makes even less sense. I'm suggesting an automatically-determined default, with the facility for people to override if the want something else.
Also, specifying a shard size in bytes makes way more sense than number of examples, as it currently is.
According to the tensorflow user guide, tfrecords files should be ~100MB (https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/performance/overview.md). When tfrecords datasets are constructed from files, the shard size could be automatically computed to follow this guidance.