refactor code to calculate records per shard using n_volumes and number of shards

neuronets / nobrainer

A framework for developing neural network models for 3D image processing.

Other

158 stars 45 forks source link

refactor code to calculate records per shard using n_volumes and number of shards #328

Open hvgazula opened 7 months ago

hvgazula commented 7 months ago

https://github.com/neuronets/nobrainer/blob/976691d685824fd4bba836498abea4184cffd798/nobrainer/dataset.py#L115-L122

If the number of volumes in the shard is too large, this snippet of code can be time-consuming. Alternatives are

use a combination of n_volumes and number of files with file_pattern to calculate len(first_shard)
provide metadata (number of volumes in the shard) as well as total number of volumes in the dataset

hvgazula commented 7 months ago

code for option 1: https://github.com/neuronets/nobrainer_training_scripts/blob/784c8668ae01356173faffbcf860bca458f46a73/1.2.0/create_tfshards.py#L307-L311 should now work after https://github.com/neuronets/nobrainer/issues/329

hvgazula commented 6 months ago

Ideally, if the tfrecords are created using the API, with the aforementioned change, we can ensure the same number of records in every shard except the last one. Now, if n_volumes is not specified, it can be calculated using this function, which is num_records_first_shard * (num_shards - 1) + num_records_in_last_shard