Refactor frame classification models to use single `WindowedFramesDatapipe`

I think VocalDataset can be rewritten to be more general, and a lot of the logic moved into transforms. This gives us more flexibility while also making the code more concise.

E.g., the following much simpler version of VocalDataset could be combined with the right transforms to give us what we have now and optionally work with other things, e.g. a model that uses audio as input. The transform should include loading audio, spectrogram files, etc. This would also make it easier to move to DataPipes should we decide to do so.

from typing import Callable, Optional, Sequence

import pandas as pd
# TODO: use vocles

from ...typing import PathLike

RETURNS_COLUMNS_MAP = {
    'spect': 'spect_path',
    'audio': 'audio_path',
    'annot': 'annot_path',
}

VOCAL_DATASET_ITEM_KEYS = list(RETURNS_COLUMNS_MAP.keys())

class VocalDataset:
    """Class representing a dataset of vocalizations,
    that can include audio, spectrograms, and annotations."

    def __init__(self,
                 csv_path: PathLike,
                 returns: Sequence[str] = ('spect', 'annot'),
                 transforms : Optional[Callable] = None,
                 ):
        self.voc_df = pd.read_csv(csv_path)
        if not all([return_ in ('audio', 'spect', 'annot')
                    for return_ in returns]):
            raise ValueError(
                f"Values for 'returns' must all be in: {{'audio', 'spect', 'annot'}} "
                f"but got '{returns}'"
            )
        self.returns = returns

        self.transforms = transforms

        def __getitem__(self, idx):
            voc_row = self.voc_df.iloc[idx, :]

            item = {
                key: voc_row[RETURNS_COLUMNS_MAP[key]] if key in self.returns else None
                for key, val in RETURNS_COLUMNS_MAP.items()
            }

            if self.transforms:
                item = self.transforms(item)
            return item

Started to make an issue but just changing this one: what we call VocalDataset now is mainly used for eval and predict when we need to make a batch of windows from a single file; but more generally we will need some dataset abstraction where "one file = one sample (x,y) in dataset" (whether there's an y may depend on the model). For example, for adding Deepsqueak in #635

Thinking about this more.

The WindowedFrameClassification class implicitly assumes it's getting batches from a different DataSet during the validation step. There's a good reason for this; we want to compute a per-file metric like segment error rate to get an estimate of those metrics per-file, since this is what a user wants to know (among other things). If each of my files is one bout of vocalizations, one song for example, how well will I do per bout?

However it also represents a kind of tight coupling between the model class and the dataset class. And in doing so it conflates the way we load the data with the concept of a "dataset" as discussed in #667; here is where a torchdata pipeline would maybe let us decouple those things. The underlying FrameClassification dataset is always just some input data $X_T$, either spectrogram or audio, and a set of target labels $Y_T$ for every frame. The thing that changes across train and eval is what we consider a sample (a window? a batch of strided windows? a file?).

But for now we just need to clarify what a VocalDataset is. It's really a PaddedWindowedFileDataset, i.e., one sample in the dataset as indexed with the __getitem__ method returns a batch consisting of a single spectrogram that is loaded and then padded so it can be made into a rectangular batch of consecutive non-overlapping windows with some size $w$.

We can convert this class to use a memmap or in-memory array by representing sample number with an ID vector like we do now for the current WindowDataset class. There will be some vector sample_id that maps to the starting index of each file within the total array. We can compute this dynamically inside the PaddedWindowedFileDataset from the existing id_vector that is of length $X_T$.

Renaming / hijacking this issue to be about other classes for frame classification too. Some of this is needed for #630

[ ] Rename current WindowDataset to RandomWindowDataset; further clarify / emphasize in docstrings the meaning of a "sample" $x$ in this dataset, i.e. what do we get back when we call __getitem__: each sample in the dataset is a window; we grab windows at random to build a batch
[ ] Add StridedWindowDataset (this is what we need for #630 and should replicate das.data.AudioSequence
- each sample in the dataset is the start of a batch of strided windows
- I thought that since the name is AudioSequence we literally iterate through the dataset, but no, I can see reading again that the class has a __getitem__ method and it grabs the batch of strided windows for each
- this is important because it means we don't need to deal with an iterable dataset in pytorch
[ ] rename VocalDataset to PaddedWindowedFileDataset, rewrite to use indexing vectors so that we can work from a single in-memory array as in #668
- Here each sample in the dataset corresponds to one sample; __getitem__ should grab the corresponding sample, pad and window it
- the problem here is we can end up with variable batch sizes and so we should set them to 1 to not blow anything up
- we could pre-compute to do a better job of figuring out the batch size we can get away with
- but for now we'll just stay with a batch size of 1

After reading das.AudioSequence again closely I realize that our WindowDataset is actually a restricted case of AudioSequence, as it is used during training. Restricted because AudioSequence introduces a notion of stride to determine what windows are chosen. This then means that WindowDataset is basically an AudioSequence with a stride of 1.

The thing that das.AudioSequence provides that WindowDataset does not is batches of consecutive strided windows. This is how AudioSequence is used during evaluation. To implement this using PyTorch conventions I think we would need a custom sampler or a dataset that's an iterable.
However I think we don't actually want to evaluate this way; we want to continue to do what we've been doing, which is to evaluate on a per-file or per-bout basis. This makes it possible to compute frame error and word/syllable/segment error rate on a per-bout basis.

I was confused about the differences between the two dataset classes because das.AudioSequence.__getitem__ returns entire batches (a keras convention?) whereas WindowDataset.__getitem__ returns a single sample that is used by a DataLoader to construct batches (the PyTorch convention).

But we can see that das.AudioSequence assembles a batch by grabbing random windows when shuffle=True, here: https://github.com/janclemenslab/das/blob/ea38976f57479c2c6c552b2699e2228e6c02669a/src/das/data.py#L283

        if self.shuffle:
            # pts = np.random.randint(self.first_sample / self.stride, (self.last_sample - self.x_hist - 1) / self.stride, self.batch_size)
            pts = np.random.choice(self.allowed_batches, size=self.batch_size, replace=False)
        else:
            pts = range(
                int(self.first_sample / self.stride) + idx * self.batch_size,
                int(self.first_sample / self.stride) + (idx + 1) * self.batch_size,
            )

(incidentally I think this implementation allows for returning the same window across multiple batches, i.e. repeats in the training set? Unless keras somehow tracks pts for a user. But there's so many possible windows even with strides that the impact on training is probably minimal)
We can also see that if shuffle is not True then we grab the consecutive strided windows to form a batch

The other thing I get out of reading the das.AudioSequence dataset more closely is that life is just easier if we can treat the data as a giant array (hence, #668).

We are very careful in the current WindowDataset to respect the boundaries of each file from the source dataset, by not permitting any window to start within one window's length from the end of every file -- i.e. considering these invalid indices for the start of a window. das.AudioSequence on the other hand only marks as out of bands away exactly one window at the end of a single array containing the entire dataset. This does mean that some windows will include the end of one file and the start of another, which might impact what the network learns. But again these windows are so relatively rare that it (probably!) does not have a huge impact.

There's also a couple drawbacks to respecting these boundaries:

It represents a huge headache in terms of maintaining vectorized / array-oriented code; see for example the monster that is https://github.com/vocalpy/vak/blob/bd466541a26c563335f46cc4ec667ba68337685f/src/vak/datasets/window_dataset/helper.py#LL1C1-L1C1
Carefully respecting these boundaries is (I think) the main reason we require that window_size be determined ahead of time -- I bet that cropping to a specified duration would be a lot less likely to fail without all the "boundary windows" removed throughout a training set
So if we remove this restriction we can re-use a dataset with various window sizes (within reason, a big enough window size could still accidentally remove rare classes from the dataset. We could also cleverly sort to avoid that.)

Renamed this issue (again?)

After working with these datasets more I think I am understanding that:

it's always frame classification -- the only difference is the targets
we basically always want to window, because (1) audio clips and thus spectrograms/frames can vary in the number of samples/time bins/frames, and we don't want to inadvertently throw OutOfMemory errors or limit what type of data we can work with, and (2) most models assume their input is a batch of fixed rectangular shapes. Other details like the stride or the fact that we use random windows during training are just that, details about how we transform the raw data when loading, that should be configured via dataset_params (as in #748)

So we can refactor to use a single DataPipe (as in #724) with params that we specify via the dataset_config (as in #748)

vocalpy / vak

Refactor frame classification models to use single `WindowedFramesDatapipe` #574