Closed NickleDave closed 6 months ago
Started to make an issue but just changing this one: what we call VocalDataset
now is mainly used for eval and predict when we need to make a batch of windows from a single file; but more generally we will need some dataset abstraction where "one file = one sample (x,y) in dataset" (whether there's an y may depend on the model). For example, for adding Deepsqueak in #635
Thinking about this more.
The WindowedFrameClassification class implicitly assumes it's getting batches from a different DataSet during the validation step. There's a good reason for this; we want to compute a per-file metric like segment error rate to get an estimate of those metrics per-file, since this is what a user wants to know (among other things). If each of my files is one bout of vocalizations, one song for example, how well will I do per bout?
However it also represents a kind of tight coupling between the model class and the dataset class. And in doing so it conflates the way we load the data with the concept of a "dataset" as discussed in #667; here is where a torchdata
pipeline would maybe let us decouple those things. The underlying FrameClassification dataset is always just some input data $X_T$, either spectrogram or audio, and a set of target labels $Y_T$ for every frame. The thing that changes across train and eval is what we consider a sample (a window? a batch of strided windows? a file?).
But for now we just need to clarify what a VocalDataset
is. It's really a PaddedWindowedFileDataset
, i.e., one sample in the dataset as indexed with the __getitem__
method returns a batch consisting of a single spectrogram that is loaded and then padded so it can be made into a rectangular batch of consecutive non-overlapping windows with some size $w$.
We can convert this class to use a memmap or in-memory array by representing sample number with an ID vector like we do now for the current WindowDataset class. There will be some vector sample_id
that maps to the starting index of each file within the total array. We can compute this dynamically inside the PaddedWindowedFileDataset from the existing id_vector that is of length $X_T$.
Renaming / hijacking this issue to be about other classes for frame classification too. Some of this is needed for #630
WindowDataset
to RandomWindowDataset
; further clarify / emphasize in docstrings the meaning of a "sample" $x$ in this dataset, i.e. what do we get back when we call __getitem__
: each sample in the dataset is a window; we grab windows at random to build a batchStridedWindowDataset
(this is what we need for #630 and should replicate das.data.AudioSequence
AudioSequence
we literally iterate through the dataset, but no, I can see reading again that the class has a __getitem__
method and it grabs the batch of strided windows for eachVocalDataset
to PaddedWindowedFileDataset
, rewrite to use indexing vectors so that we can work from a single in-memory array as in #668
__getitem__
should grab the corresponding sample, pad and window itAfter reading das.AudioSequence
again closely I realize that our WindowDataset
is actually a restricted case of AudioSequence
, as it is used during training. Restricted because AudioSequence
introduces a notion of stride
to determine what windows are chosen. This then means that WindowDataset
is basically an AudioSequence
with a stride of 1.
The thing that das.AudioSequence
provides that WindowDataset
does not is batches of consecutive strided windows. This is how AudioSequence
is used during evaluation. To implement this using PyTorch conventions I think we would need a custom sampler or a dataset that's an iterable.
However I think we don't actually want to evaluate this way; we want to continue to do what we've been doing, which is to evaluate on a per-file or per-bout basis. This makes it possible to compute frame error and word/syllable/segment error rate on a per-bout basis.
I was confused about the differences between the two dataset classes because das.AudioSequence.__getitem__
returns entire batches (a keras convention?) whereas WindowDataset.__getitem__
returns a single sample that is used by a DataLoader to construct batches (the PyTorch convention).
But we can see that das.AudioSequence
assembles a batch by grabbing random windows when shuffle=True
, here:
https://github.com/janclemenslab/das/blob/ea38976f57479c2c6c552b2699e2228e6c02669a/src/das/data.py#L283
if self.shuffle:
# pts = np.random.randint(self.first_sample / self.stride, (self.last_sample - self.x_hist - 1) / self.stride, self.batch_size)
pts = np.random.choice(self.allowed_batches, size=self.batch_size, replace=False)
else:
pts = range(
int(self.first_sample / self.stride) + idx * self.batch_size,
int(self.first_sample / self.stride) + (idx + 1) * self.batch_size,
)
(incidentally I think this implementation allows for returning the same window across multiple batches, i.e. repeats in the training set? Unless keras somehow tracks pts
for a user. But there's so many possible windows even with strides that the impact on training is probably minimal)
We can also see that if shuffle
is not True
then we grab the consecutive strided windows to form a batch
The other thing I get out of reading the das.AudioSequence
dataset more closely is that life is just easier if we can treat the data as a giant array (hence, #668).
We are very careful in the current WindowDataset
to respect the boundaries of each file from the source dataset, by not permitting any window to start within one window's length from the end of every file -- i.e. considering these invalid indices for the start of a window. das.AudioSequence
on the other hand only marks as out of bands away exactly one window at the end of a single array containing the entire dataset. This does mean that some windows will include the end of one file and the start of another, which might impact what the network learns. But again these windows are so relatively rare that it (probably!) does not have a huge impact.
There's also a couple drawbacks to respecting these boundaries:
window_size
be determined ahead of time -- I bet that cropping to a specified duration would be a lot less likely to fail without all the "boundary windows" removed throughout a training setRenamed this issue (again?)
After working with these datasets more I think I am understanding that:
stride
or the fact that we use random windows during training are just that, details about how we transform the raw data when loading, that should be configured via dataset_params
(as in #748) So we can refactor to use a single DataPipe
(as in #724) with params
that we specify via the dataset_config
(as in #748)
I think
VocalDataset
can be rewritten to be more general, and a lot of the logic moved into transforms. This gives us more flexibility while also making the code more concise.E.g., the following much simpler version of
VocalDataset
could be combined with the right transforms to give us what we have now and optionally work with other things, e.g. a model that uses audio as input. The transform should include loading audio, spectrogram files, etc. This would also make it easier to move to DataPipes should we decide to do so.