Open hvgazula opened 7 months ago
@satra how about saving the indices in the filename itself..something like kwyk-train-{00000..00150}.tfrecord
,kwyk-train-{00151..00300}.tfrecord
and so on..
sharded representations mean filenames won't carry appropriate indices. there is a default shard size included, but it can be overwritten. https://github.com/neuronets/nobrainer/blob/976691d685824fd4bba836498abea4184cffd798/nobrainer/dataset.py#L155
I think I understand your idea of "shard" but just to make sure, do you agree that the "shards" created by the API are merely, the files globbed (no randomness) and then split into 300 (aka shard_size) each (using array_split) and then serialized sequentially? If you agree with my explanation, it only means the sharded representations can be tweaked to carry the appropriate indices. I gave you an example using 150..but more generally, the following snippet (in tfrecord.py):
n_examples = len(features_labels)
n_shards = math.ceil(n_examples / examples_per_shard)
shards = np.array_split(features_labels, n_shards)
will be replaced with
n_examples = len(features_labels)
n_shards = math.ceil(n_examples / examples_per_shard)
shards = np.array_split(list(zip(enumerate(features_labels))), n_shards)
where the first element of the first and last items in the list will give the appropriate indices for the filename and this is tied to the shard_size specified at the time of creation (so no loss of generality).
PS: the zip. enumerate snippet I wrote was only for demo purposes
yes, shards break a binary data stream into accessible pieces without changing the overall structure.
however, nobrainer has a notion of volumes and blocks. if you break a volume into blocks, what matters from the dataset perspective is not the volume index but the block index. hence, len(filenames) is less important than len(blocks).
i'm still not seeing why we want to stick semantics in filename when it can be accessed internally using the metadata, and one that can be accessed directly through the tfrecord.
The only problem with this approach is the count is tied to the original dataset. That is, if I want to use a subset of the dataset for testing purposes I have to create the shards from scratch again. Neverthelesss, I will go ahead and add the full datacount (and optionally the volumes in that shard).
just create another dataset for now. yes, in the ideal world (an MVP+1 problem), we would be able to select any subset for train/eval from a dataset or have something that trims a dataset.
We decided to add an extra feature to each record/example labeled "data_count". While we do this, we also need to add logic to adjust the number of volumes in each epoch (in case drop_reminder is set to True during batching). This is also important because the bayesian meshnet requires the number of examples upfront. See https://github.com/neuronets/nobrainer/blob/976691d685824fd4bba836498abea4184cffd798/nobrainer/models/bayesian_meshnet.py#L20