Add data count to tfrecords

hvgazula commented 7 months ago

We decided to add an extra feature to each record/example labeled "data_count". While we do this, we also need to add logic to adjust the number of volumes in each epoch (in case drop_reminder is set to True during batching). This is also important because the bayesian meshnet requires the number of examples upfront. See https://github.com/neuronets/nobrainer/blob/976691d685824fd4bba836498abea4184cffd798/nobrainer/models/bayesian_meshnet.py#L20

hvgazula commented 7 months ago

@satra how about saving the indices in the filename itself..something like kwyk-train-{00000..00150}.tfrecord,kwyk-train-{00151..00300}.tfrecord and so on..

satra commented 7 months ago

sharded representations mean filenames won't carry appropriate indices. there is a default shard size included, but it can be overwritten. https://github.com/neuronets/nobrainer/blob/976691d685824fd4bba836498abea4184cffd798/nobrainer/dataset.py#L155

hvgazula commented 7 months ago

I think I understand your idea of "shard" but just to make sure, do you agree that the "shards" created by the API are merely, the files globbed (no randomness) and then split into 300 (aka shard_size) each (using array_split) and then serialized sequentially? If you agree with my explanation, it only means the sharded representations can be tweaked to carry the appropriate indices. I gave you an example using 150..but more generally, the following snippet (in tfrecord.py):

n_examples = len(features_labels)
n_shards = math.ceil(n_examples / examples_per_shard)
shards = np.array_split(features_labels, n_shards)

will be replaced with

n_examples = len(features_labels)
n_shards = math.ceil(n_examples / examples_per_shard)
shards = np.array_split(list(zip(enumerate(features_labels))), n_shards)

where the first element of the first and last items in the list will give the appropriate indices for the filename and this is tied to the shard_size specified at the time of creation (so no loss of generality).

PS: the zip. enumerate snippet I wrote was only for demo purposes

satra commented 7 months ago

yes, shards break a binary data stream into accessible pieces without changing the overall structure.

however, nobrainer has a notion of volumes and blocks. if you break a volume into blocks, what matters from the dataset perspective is not the volume index but the block index. hence, len(filenames) is less important than len(blocks).

i'm still not seeing why we want to stick semantics in filename when it can be accessed internally using the metadata, and one that can be accessed directly through the tfrecord.

hvgazula commented 7 months ago

The only problem with this approach is the count is tied to the original dataset. That is, if I want to use a subset of the dataset for testing purposes I have to create the shards from scratch again. Neverthelesss, I will go ahead and add the full datacount (and optionally the volumes in that shard).

satra commented 7 months ago

just create another dataset for now. yes, in the ideal world (an MVP+1 problem), we would be able to select any subset for train/eval from a dataset or have something that trims a dataset.

neuronets / nobrainer

Add data count to tfrecords #321