vocalpy / vak

A neural network framework for researchers studying acoustic communication
https://vak.readthedocs.io
BSD 3-Clause "New" or "Revised" License
78 stars 16 forks source link

Refactor frame classification models to use single `WindowedFramesDatapipe` #574

Closed NickleDave closed 6 months ago

NickleDave commented 2 years ago

I think VocalDataset can be rewritten to be more general, and a lot of the logic moved into transforms. This gives us more flexibility while also making the code more concise.

E.g., the following much simpler version of VocalDataset could be combined with the right transforms to give us what we have now and optionally work with other things, e.g. a model that uses audio as input. The transform should include loading audio, spectrogram files, etc. This would also make it easier to move to DataPipes should we decide to do so.

from typing import Callable, Optional, Sequence

import pandas as pd
# TODO: use vocles

from ...typing import PathLike

RETURNS_COLUMNS_MAP = {
    'spect': 'spect_path',
    'audio': 'audio_path',
    'annot': 'annot_path',
}

VOCAL_DATASET_ITEM_KEYS = list(RETURNS_COLUMNS_MAP.keys())

class VocalDataset:
    """Class representing a dataset of vocalizations,
    that can include audio, spectrograms, and annotations."

    def __init__(self,
                 csv_path: PathLike,
                 returns: Sequence[str] = ('spect', 'annot'),
                 transforms : Optional[Callable] = None,
                 ):
        self.voc_df = pd.read_csv(csv_path)
        if not all([return_ in ('audio', 'spect', 'annot')
                    for return_ in returns]):
            raise ValueError(
                f"Values for 'returns' must all be in: {{'audio', 'spect', 'annot'}} "
                f"but got '{returns}'"
            )
        self.returns = returns

        self.transforms = transforms

        def __getitem__(self, idx):
            voc_row = self.voc_df.iloc[idx, :]

            item = {
                key: voc_row[RETURNS_COLUMNS_MAP[key]] if key in self.returns else None
                for key, val in RETURNS_COLUMNS_MAP.items()
            }

            if self.transforms:
                item = self.transforms(item)
            return item
NickleDave commented 1 year ago

Started to make an issue but just changing this one: what we call VocalDataset now is mainly used for eval and predict when we need to make a batch of windows from a single file; but more generally we will need some dataset abstraction where "one file = one sample (x,y) in dataset" (whether there's an y may depend on the model). For example, for adding Deepsqueak in #635

NickleDave commented 1 year ago

Thinking about this more.

The WindowedFrameClassification class implicitly assumes it's getting batches from a different DataSet during the validation step. There's a good reason for this; we want to compute a per-file metric like segment error rate to get an estimate of those metrics per-file, since this is what a user wants to know (among other things). If each of my files is one bout of vocalizations, one song for example, how well will I do per bout?

However it also represents a kind of tight coupling between the model class and the dataset class. And in doing so it conflates the way we load the data with the concept of a "dataset" as discussed in #667; here is where a torchdata pipeline would maybe let us decouple those things. The underlying FrameClassification dataset is always just some input data $X_T$, either spectrogram or audio, and a set of target labels $Y_T$ for every frame. The thing that changes across train and eval is what we consider a sample (a window? a batch of strided windows? a file?).

But for now we just need to clarify what a VocalDataset is. It's really a PaddedWindowedFileDataset, i.e., one sample in the dataset as indexed with the __getitem__ method returns a batch consisting of a single spectrogram that is loaded and then padded so it can be made into a rectangular batch of consecutive non-overlapping windows with some size $w$.

We can convert this class to use a memmap or in-memory array by representing sample number with an ID vector like we do now for the current WindowDataset class. There will be some vector sample_id that maps to the starting index of each file within the total array. We can compute this dynamically inside the PaddedWindowedFileDataset from the existing id_vector that is of length $X_T$.

NickleDave commented 1 year ago

Renaming / hijacking this issue to be about other classes for frame classification too. Some of this is needed for #630

NickleDave commented 1 year ago

After reading das.AudioSequence again closely I realize that our WindowDataset is actually a restricted case of AudioSequence, as it is used during training. Restricted because AudioSequence introduces a notion of stride to determine what windows are chosen. This then means that WindowDataset is basically an AudioSequence with a stride of 1.

The thing that das.AudioSequence provides that WindowDataset does not is batches of consecutive strided windows. This is how AudioSequence is used during evaluation. To implement this using PyTorch conventions I think we would need a custom sampler or a dataset that's an iterable.
However I think we don't actually want to evaluate this way; we want to continue to do what we've been doing, which is to evaluate on a per-file or per-bout basis. This makes it possible to compute frame error and word/syllable/segment error rate on a per-bout basis.

I was confused about the differences between the two dataset classes because das.AudioSequence.__getitem__ returns entire batches (a keras convention?) whereas WindowDataset.__getitem__ returns a single sample that is used by a DataLoader to construct batches (the PyTorch convention).

But we can see that das.AudioSequence assembles a batch by grabbing random windows when shuffle=True, here: https://github.com/janclemenslab/das/blob/ea38976f57479c2c6c552b2699e2228e6c02669a/src/das/data.py#L283

        if self.shuffle:
            # pts = np.random.randint(self.first_sample / self.stride, (self.last_sample - self.x_hist - 1) / self.stride, self.batch_size)
            pts = np.random.choice(self.allowed_batches, size=self.batch_size, replace=False)
        else:
            pts = range(
                int(self.first_sample / self.stride) + idx * self.batch_size,
                int(self.first_sample / self.stride) + (idx + 1) * self.batch_size,
            )

(incidentally I think this implementation allows for returning the same window across multiple batches, i.e. repeats in the training set? Unless keras somehow tracks pts for a user. But there's so many possible windows even with strides that the impact on training is probably minimal)
We can also see that if shuffle is not True then we grab the consecutive strided windows to form a batch

NickleDave commented 1 year ago

The other thing I get out of reading the das.AudioSequence dataset more closely is that life is just easier if we can treat the data as a giant array (hence, #668).

We are very careful in the current WindowDataset to respect the boundaries of each file from the source dataset, by not permitting any window to start within one window's length from the end of every file -- i.e. considering these invalid indices for the start of a window. das.AudioSequence on the other hand only marks as out of bands away exactly one window at the end of a single array containing the entire dataset. This does mean that some windows will include the end of one file and the start of another, which might impact what the network learns. But again these windows are so relatively rare that it (probably!) does not have a huge impact.

There's also a couple drawbacks to respecting these boundaries:

NickleDave commented 7 months ago

Renamed this issue (again?)

After working with these datasets more I think I am understanding that:

So we can refactor to use a single DataPipe (as in #724) with params that we specify via the dataset_config (as in #748)