chunking, batch creation

keunwoochoi / torchaudio-contrib

A test bed for updates and new features | pytorch/audio

169 stars 22 forks source link

chunking, batch creation #29

Open nkundiushuti opened 5 years ago

nkundiushuti commented 5 years ago

Hi!

I have faced this issue for the past years. I wonder what are the options. I will present my solution which of course is not the best either the most efficient. The problems are: working with datasets which don't fit into memory, and random grouping of chunks from different files. Basically you don't want to end up with a batch made of instances from the same audio file. To solve this problems I scanned the dataset before and assigned an id to each chunk and each file (I count how many chunks I have per audio file). Then I randomize this list and I pop elements when loading batches. Data augmentation: a simple but efficient augmentation that can be applied at this step is to have overlapping chunks. In source separation or inverse problems this is required at the output stage too otherwise you have some discontinuities between chunks. Other data augmentation? Maybe a separate topic. Jan should have more experience with this. I am doing it as a post-processing step but it can be applied layer-wise (augmentation parameter list in which you have different options which are applied at batch level, similarly to batch normalization).

hbredin commented 5 years ago

The problems are: working with datasets which don't fit into memory, and random grouping of chunks from different files. Basically you don't want to end up with a batch made of instances from the same audio file.

To solve this problems I scanned the dataset before and assigned an id to each chunk and each file (I count how many chunks I have per audio file). Then I randomize this list and I pop elements when loading batches.

I do something similar in pyannote.audio:

start by loading detailed metadata (file durations, labels, etc...) about the dataset
then randomly select chunks until I have enough to make a batch

Gathering those random batches is usually the bottleneck (in terms of training time) in all my experiments. One could probably rely on pescador under the hood (or simply extend pescador with this kind of functionality).

faroit commented 5 years ago

@nkundiushuti thanks for bringing this up. This is a big issue for music researchers where we deal with variable length audio and models are not capable to use long term temporal context so we still use chunks of a few seconds.

The problems are: working with datasets which don't fit into memory, and random grouping of chunks from different files. Basically you don't want to end up with a batch made of instances from the same audio file.

This is actually only one part of the problem, and its not even the biggest one. One can formalize it into how we draw samples for 1 an epoch or 2 for one batch. For both we can choose if we want:

	Track	Chunk/Excerpt
A	with replacement	with replacement
B	with replacement	w/o replacement
C	w/o replacement	with replacement
D	w/o replacement	w/o replacement

That are a lot of combinations actually, and I guess that not many researcher systematically evaluate the performance based on the sampling method being used.

While A is easy to implement using batch generators, research showed that that with-replacement sampling performs worse than without-replacement sampling, however this is valid only for non-convex problems and is probably currently under discussion.

I observed for many models the performance slightly improves if for one epoch all chunks/excerpts are really seen (1D). But the same cannot be said for sampling within a batch. Having only unique tracks in a batch might be nice to have and might help for some classification tasks with small data sets. However, at least for source separation, we couldn't find any improvement compared to sampling chunks with replacement. However, the difference for chunks without replacement (=making sure that you see each chunk only once within one epoch) was significant.

Maybe here I refer to my notebook with various examples here.

In my point of view, it is currently still not clear what the best sampling strategy for music track is is as it depends on the application. So I would propose that all we should aim at, is supporting a simple pytorch implementation for the variants mentioned above.

The main problem here for pytorch dataset class and the sampler is that it is solely based on indices. Implementing a hierarchical sampling e.g. tracks -> within_tracks, is not very elegant in pytorch since you first would need to determine the exact number of chunks=samples before you could start training. Many times you would end up using "fake indices" to define a dataset. For audio applications it would actually make more sense if we would get a generator/consumer based dataset api.

@keunwoochoi @f0k what is your take on this problem? Should we provide an efficient helper function for track/excerpt sampling?

faroit commented 5 years ago

One could probably rely on pescador under the hood (or simply extend pescador with this kind of functionality).

@hbredin I used pescador in many of my experience, but you loose performance due to the zmq based parallelization. I didn't systematically benchmark any of this, though...

keunwoochoi commented 5 years ago

Definitely a good thing to have. What'd be the API for the function though? Have a generator or iterator that produces.. file paths maybe?

faroit commented 5 years ago

Definitely a good thing to have. What'd be the API for the function though? Have a generator or iterator that produces.. file paths maybe?

since the sampling is within a file it would be more or less seeking positions...

faroit commented 5 years ago

actually an IterableDataset is currently being developed for PyTorch. For the mentioned hierarchical sampling strategies, it probably would make sense to wait for this to be merged and added to PyTorch before we add our own code.

f0k commented 5 years ago

My usual take on this similar: I first collect the labels and the lengths of all audio files, then I extract random excerpts, uniformly sampling first the files (without replacement), then a position within each file (with replacement).

Gathering those random batches is usually the bottleneck

To be fast enough, I store files in .wav format and open them as memory-mapped files (only when accessing them; I cannot keep them open because then I run out of file pointers for the process). This way there is no overhead for decoding or seeking. If I have a lot of RAM, they will end up in the cache over time, which is about as fast as loading them explicitly into memory, but allows multiple processes to train on the same dataset at the same time. If I don't have a lot of RAM, they'll be loaded from disk, and I place the files on one or two SSDs. To avoid the main thread being bothered with I/O, I defer loading into a background thread with a simple wrapper of the generator. When I have a lot of CPU-based augmentation, I use multiple background threads.

The main problem here for pytorch dataset class and the sampler is that it is solely based indices.

Yes, that wouldn't make sense. If we define each possible excerpt of each possible file to have its own index, we (a) need to map back from indexes to files in __getitem__, and (b) will bias sampling towards longer files. Also this may make sampling without replacement very slow. Many years ago I had an implementation like that, and it took numpy half a minute to shuffle the indices :D

I count how many chunks I have per audio file

In my "it took numpy half a minute to shuffle" case, the number of chunks per file was the number of spectrogram frames minus the excerpt length in frames plus 1. In my current implementations, the number of chunks per file is the number of audio samples minus the excerpt length in samples + 1.

Should we provide an efficient helper function for track/excerpt sampling?

I'm unsure. I think yes, but it can also become quite a big project that might rather be deferred to another library (I should look at pescador again). I'd like to collect some common use cases / corner cases / aspects that I'd want to be either covered, or not hindered by the implementation.

For training:
1. Obtain batches of random fixed-size excerpts from files that are guaranteed to be longer than that
2. Obtain batches of random fixed-size excerpts from files that may be smaller
3. Obtain batches of random same-size excerpts from files of vastly different lengths, with a given minimum and maximum excerpt size
For testing:
1. Obtain batches of ordered fixed-size excerpts for testing, with configurable size and overlap
2. Obtain full files for testing, not batched
Multiple data sources: Support an arbitrary number of sources sampled together in a consistent way. This could be (input, label) pairs, or (input1, input2, input3) tuples, or just (input,) tuples. (Possibly support named tuples or dictionaries.) Allow sources to be defined separately and then combined later on.
Inhomogeneous datasets: Support datasets of varying sample rates or channel counts. Support datasets of varying numbers of data sources.
Data augmentation: could be done as a step between loading the excerpts and passing them to the model, but that will preclude some efficient implementations. For example, pitch shifting and time stretching can be cheaply obtained by stretching/shrinking the spectrogram before the mel filterbank, or even cheaper by playing with the hz_to_mel/mel_to_hz conversion of the mel filterbank, as well as the FFT window size and stride. Other augmentations such as mixup are better to apply before the STFT, to reduce the number of spectrograms to be computed. (I think we should allow the STFT, mel filterbank etc. to be used either as Modules that are part of the network, or as transformations that are part of the loading pipeline. Modules are callables anyway, so they should be useable as transformations in the spirit of the current torchaudio.transforms module.)

I don't have a solution satisfying all of this yet, and I'm not sure whether we should build this in torchaudio, but it would be nice to have.

For the mentioned hierarchical sampling strategies, it probably would make sense to wait for this to be merged and added to PyTorch before we add our own code.

Yes, we should take care to stay compatible to PyTorch's machinery, so we can benefit from it.

faroit commented 5 years ago

@f0k thanks for your detailed insights.

My usual take on this similar: I first collect the labels and the lengths of all audio files, then I extract random excerpts, uniformly sampling first the files (without replacement), then a position within each file (with replacement).

In this setting, how many excerpts/samples do you yield in total and do you usually evaluate this parameter?

Yes, that wouldn't make sense. If we define each possible excerpt of each possible file to have its own index, we (a) need to map back from indexes to files in getitem, and (b) will bias sampling towards longer files. Also this may make sampling without replacement very slow. Many years ago I had an implementation like that, and it took numpy half a minute to shuffle the indices 🗡

Yes, we should take care to stay compatible to PyTorch's machinery, so we can benefit from it.

So just to summarize, you are doing tracks (without replacement) and excerpts (without replacement) because it is the only way to efficiently handle very large data sets, right?

I think that means, for now its probably the best to stick with indices based sampling where the indices are only assigned to the number of tracks. The downsides are 1. not seeing all excerpts might not be optimal for small datasets 2. for datasets of very few and very long tracks the definition of an epoch becomes meaningless since they would be super short. That also affects the maximum batch size to be used.

I will implement a proposal for this soon

I'd like to collect some common use cases / corner cases / aspects that I'd want to be either covered, or not hindered by the implementation.

Thats a great listing. Many of the can be handled my pescador but yes, we should think about how to do that in torch audio later.

f0k commented 5 years ago

In this setting, how many excerpts/samples do you yield in total and do you usually evaluate this parameter?

It's basically an infinite iterator, and I yield mini-batches until the validation error does not improve any more, checking the validation error (and possibly adapting the learning rates) every k updates (I sometimes refer to this as a mini-epoch).

So just to summarize, you are doing tracks (without replacement) and excerpts (without replacement) because it is the only way to efficiently handle very large data sets, right?

Excerpts with replacement! I don't want to memorize which excerpts per track have been seen already, so I just hope independently random positions will do.

I think that means, for now its probably the best to stick with indices based sampling where the indices are only assigned to the number of tracks.

Yes, but ideally, it would be easy to change the sampling strategy.

for datasets of very few and very long tracks the definition of an epoch becomes meaningless

I think the definition of an epoch is always kind of meaningless when we train on excerpts. For a small dataset of not too long files, it still doesn't make sense to present all possible excerpts of all possible files as an epoch -- excerpts from the same file will be very similar, even if they don't overlap, so it's redundant to go through all of them. I'm very happy with decoupling the epoch size from the dataset size (and the batch size).

I will implement a proposal for this soon

Don't go too fast! I think what we first need would be a proposal for the API. Not a complete implementation, but a definition of the functions or classes and methods, with docstrings if needed, but without bodies. At this stage it's much easier to change things around than when we already have code with it (that may even have to be thrown away).

Whenever I think about it, I go through the following:

The labels are very different between tasks -- sometimes they are global per file, sometimes I want a label for the center of the excerpt, sometimes I want a label sequence for the excerpt, sometimes a spectral mask, sometimes a combination of the above. Sometimes I don't have labels for training, and I don't have labels for testing ("testing" = "inference on unseen data", not in the strict sense of evaluating a model on my labeled validation or test set).
Audio access is also different between tasks -- sometimes I want to precompute spectrograms for performance, sometimes I can't for memory constraints, sometimes I explicitly want to train on raw audio, sometimes I want to combine different features that may not even live on the same timescale.
Therefore, I want to be able to define access to these things separately and combine them afterwards.
For training, I need pairs/tuples of mini-batches of tensors (e.g., inputs and labels) that correspond. Thus, the definition of the labels and audio cannot be in the form of independent iterators over the data that provide shuffled mini-batches. It also cannot be in the form of an object with a get(file_id, position, length) function, because we may want to include data augmentation, and for some tasks and augmentations this cannot be done independently for the labels and audio.
So, there should be a separation of the dataset / the data sources, and the data iteration. The former would provide access to tensors, in a completely deterministic way, via some get(**kwargs) function. The latter would generate mini-batches from the data sources, providing whatever randomness is needed in the kwargs.
Data augmentation could be implemented by wrapping the data sources. Their get() implementation would call the underlying get() function, modify the result, and return it. Compared to other ways of chaining transformations, this allows them to modify the arguments to the get() call. For example, if a time stretching augmentation is asked to provide a 10-second excerpt at 150% speed, it will need to ask the underlying data source for a 15-second excerpt.
It makes sense for the iterator to know the excerpt lengths. For things like pitch-shifting or time-stretching, it just has to generate random numbers and pass them to the data sources, who will each know what to do. The position could be done in the same way: generate a random number between 0 and 1 and the data source will deal with it. (This would easily extend to multiple dimensions, such as sampling patches from images.) But for my use case "For training: iii", the selection of the files depends on the file lengths. So in addition to a get() function, the data sources should provide the shapes and dtypes of their tensors.
In order to deal with mini-batches from multiple data sources, it may be convenient if the iterator wrapped them in dictionaries or named tuples rather than relying on plain tuples and remembering the order. It's easy as long as you always deal with (input, label) pairs, but may get difficult if you have multiple types of inputs and labels, or not all of them are present for all files.

I'm happy to discuss advantages and shortcomings of this design, or completely different designs.

Many of the can be handled my pescador

I really need to look again why I decided not to use it!

f0k commented 5 years ago

Many of the can be handled my pescador

I really need to look again why I decided not to use it!

Okay, I read through the documentation and some of the code. pescador provides a Streamer, which wraps a generator function along with its args and kwargs, and then can start the generator whenever needed. Such an extra abstraction is important -- also the data iterator in my previous comment would return a generator whenever asked for it. That's what allows data iteration to be parallelized across multiple workers. In addition, pescador provides ways to interleave different streams. This allows to implement interesting forms of hierarchical sampling. However, pescador assumes that the streams already provide matching samples of multiple data sources (e.g., inputs and targets). It uses dictionaries of ndarrays, which I've converged on as well (and then provides tools to assemble dictionaries of samples into dictionaries of batches, and to convert dictionaries into tuples to interface with APIs that require them). What I'm thinking about here is how to provide these matching samples of multiple data sources.

Just to be clear, I'm not looking for a more efficient simulation of the following:

A typical musical genre recognition paper: Chunk up the songs into 10-second excerpts with 5-second overlaps, resulting in N chunks. Pair each chunk up with its class label. Iterate over the pairs in random order. For testing, chunk up the test songs in the same way, pass them through the network, apply a majority vote. :arrow_right: This is the use case covered by any data loader. You pass two same-sized lists of ndarrays and you're done. Just like MNIST. It's limiting the variability of what the network will see during training for no other reason than convenience and it incurs redundant computation at test time (since the overlapped parts will be processed multiple times).

The following should be covered, but is still not what I'm worried about:

Another typical musical genre recognition paper: Load all the songs into memory, as well as the labels, so we have two same-sized lists of ndarrays again. During training, pick random pairs, and randomly crop the song so a 10-second-excerpt remains, calling it "label-preserving data augmentation". :arrow_right: That's easy to implement even using torch.utils.data.DataSet. Audio and labels can even be two separate DataSets that are joined by an overarching DataSet, just like I wanted. When torch.utils.data.DataLoader asks for an item, the overaching DataSet asks the audio and label DataSet, each return their part -- the audio DataSet will only load a random excerpt of the requested item --, and the overarching DataSet returns them as a tuple.
My 2012 speech detector: Precompute spectrograms for all the recordings. Load the labeled segments from a text file, and precompute binary vectors that match the length and frame rate of the respective spectrograms. During training, pick random recordings and random frames within the recordings. Extract a k-second spectrogram excerpt around that frame, padded with zeros if too close to the border, and pick the label corresponding to that frame. :arrow_right: Now that we don't have global labels, cropping is not a label-preserving operation any more: the label needs to be selected based on which excerpt was chosen. We cannot have separate torch.utils.data.DataSets for audio data and labels, with the audio DataSet picking a random excerpt. What I did back then (I was young) was to give the two data sources two RNGs initialized to the same seed, so they'd produce matching samples. What I proposed above was to move the randomness one level out. In this case, it would suffice to put it into the overarching DataSet, so torch.utils.data.DataLoader wouldn't even have to know about it.

What troubles me is the following:

For some use cases, the sampling of recordings for batches should depend on the length, so the DataLoader would need to know about the lengths. (I don't like this.)
Global labels don't have a length, so the DataLoader would need to distinguish between data sources that have a length and sources that don't. (I don't like this.)
Local labels are usually defined in terms of seconds (e.g., as start/stop segments, or as event positions), while audio data is sampled. To match the labels and audio excerpts, someone needs to convert between the two.
Even worse, when we want pairs of audio excerpts and label sequences (instead of a single label), the labels need to be sampled at the correct rate.
Inhomogeneous datasets may have different sample rates for different recordings. So the position and length in terms of samples are not meaningful. So the DataLoader either needs to know about the rate, or it needs to use seconds and hope for no rounding errors.

I've been tossing some ideas around, but I'm not happy yet. What I'd want as an end user is to define an audio data source like this:

audio = AudioFileSource(filenames, ...)

And a label source like this:

class LabelsForThisTask(DataSource):
    def __init__(self, filenames, labeldir):
        # load the segments for each filename

    def __len__(self):
        return len(self.segments)

    def get(self, idx, start, end, rate):
        return self.get_at(idx, (start + end) / 2)

    def get_at(self, idx, position):
        segments = self.segments[idx]
        segment = np.searchsorted(segments['bounds'], position)
        return segments['labels'][segment]

labels = LabelsForThisTask(filenames, labeldir)

Or actually, I guess this class would exist already, but in general I'd want to be able to define the labels in terms of a function that returns the label for a particular file and position or range.

And data iterators like this:

train_loader = RandomExcerptLoader(dict(x=audio, y=labels), length=10)
valid_loader = ExcerptLoader(dict(x=AudioFileSource(filenames_val), y=LabelsForThisTask(filenames_val, labeldir)), min_length=2, max_length=30, max_per_file=1)
test_loader = ExcerptLoader(dict(x=Padding(AudioFileSource(filenames_test), pad=Fraction(51, 70), mode='reflect'), name=StringSource(filenames_test)), max_length=60, overlap=Fraction(51, 70))

And then be able to call them in the training code:

batches = train_loader.feed(batchsize=batchsize, infinite=True)
batches = generate_in_background(batches)
for epoch in epochs:
    for _ in trange(epochsize):
        batch = next(batches)
        training_step(**batch)
    for batch in valid_loader.feed(batchsize=1):
        valid_step(**batch)

But the parts in between are not completely clear to me. My first draft for the data source base class was this:

class DataSource(object):
    """
    Encapsulates a list of tensors of the same dimensionality, but possibly
    different shapes. `shape` and `dtype` can be passed to populate the
    corresponding properties, otherwise they will be inferred when first
    accessed. If `timeless` is given, the items do not have a time dimension
    that can be accessed in `get`; they will be the same for any excerpt.
    """
    def __init__(self, shape=None, dtype=None, timeless=False):
        self._shape = shape
        self._dtype = dtype
        self.timeless = timeless

    def __len__(self):
        # To be implemented in subclasses
        raise NotImplementedError()

    def shape_of(self, idx):
        # To be implemented in subclasses
        raise NotImplementedError()

    def dtype_of(self, idx):
        # To be implemented in subclasses
        raise NotImplementedError()

    def get(self, idx, start=None, stop=None, stride=None):
        # To be implemented in subclasses
        raise NotImplementedError()

    @property
    def shape(self):
        if self._shape is None and len(self) > 0:
            shape = self.shape_of(0)
            for idx in range(1, len(self)):
                shape = tuple(a if (a is not None) and (a == b) else None
                              for a, b in zip(shape, self.shape_of(idx)))
            self._shape = shape
        return self._shape

    @property
    def dtype(self):
        if self._dtype is None and len(self) > 0:
            dtype = dtype_of(0)
            if any(dtype != self.dtype_of(idx) for idx in range(1, len(self))):
                dtype = None
            self._dtype = dtype
        return self._dtype

    def __getitem__(self, key):
        if isinstance(key, int):
            return self.get(key)
        elif isinstance(key, slice):
            start, stop, stride = key.indices(len(self))
            return [self.get(idx) for idx in range(start, stop, stride)]
        else:
            raise KeyError('Unsupported key %r, expected int or slice' % key)

I don't like the timeless flag. Could be a parent class. Or, to make it more generic, the data source could tell how many (or which) dimensions are meaningful to take excerpts of, so it would support global labels, sequences, images and volumes.
It's still missing a clear approach on how to handle sample rates, frame rates, and seconds. The data source could have a rate property similar to shape and dtype, and a rate_of method. But what would that mean for label sources, which are not discretized? Should this become a parent class again? Also, the get function for label sequences should accept a rate at which it is meant to be sampled. Audio sources could get the same keyword argument, forcing them to resample. But what is the unit of start and end? Would it be in terms of rate? And if rate is omitted, start and end are in terms of the native rate? Is start and end good, or should it be start and length, or even start and count? Who provides the rate for label sources? The data iterator? Or should this be fixed at construction time? Do we need it to be variable for any of the use cases?
It's still missing an approach for concerted data augmentation. I was thinking about signaling to the data iterator what kind of randomness it needs and under what names, and then receiving that as additional keyword arguments in get. Data sources that request randomness with the same name would get the same values. Possibly a bit brittle.

Well, sorry for the long post, I hope to spur some discussion!

f0k commented 5 years ago

Data sources that request randomness with the same name would get the same values. Possibly a bit brittle.

Actually, if the user can override which key each data source uses, and can tell the iterator what source of randomness to use for each key, it's not so brittle any more. Each augmenting data source would use some default key (such as pitch_shift for a raw audio pitch shifter, or a pitch shifting mel filterbank, or a pitch shifting label transformator) that could be overridden for special purposes. I'd imagine something like this:

audio = AudioFileSource(...)
audio = STFT(audio, ...)
audio = ShiftingMelFilterbank(audio, ...)
labels = ...
labels = LabelPitchShift(labels, ...)
train_loader = RandomExcerptLoader(dict(x=audio, y=labels), rngs=dict(pitch_shift=Uniform(0.7, 1.3)))

And the loader would do something like:

def feed(batchsize, infinite=False, drop_remainder=True):
    batch = {k: np.empty(...) for k in self.sources}
    idxs = np.arange(len(self))
    while True:
        np.random.shuffle(idxs)
        count = 0
        for idx in idxs:
            randomness = {k: v.sample() for k, v in self.rngs}
            for k, source in self.sources.items():
                batch[k][count] = source.get(idx, *randomness)
                count += 1
                if count == batchsize:
                    yield batch
                    count = 0
        if count < batchsize and not drop_remainder:
            yield {k: data[:count] for k, data in batch.items()}
        if not infinite:
            break

Aaaand some more thoughts and design questions.

I have two working assumptions that influence the design:
1. I assume that for networks that work with mel spectrograms, the most performant time stretching / pitch shifting data augmentation is to adapt the STFT and filterbank. E.g., pitch shifting can be done by scaling the mel/hz conversion, and time stretching by scaling the STFT stride. This will not give a very high quality (e.g., it will do bad things to transients), but I've previously worked with stretching the spectrogram using bicubic or bilinear interpolation, and that seemed good enough for the network. Of course this will depend on the task.
2. I assume that the most efficient resampling is to adapt the STFT and filterbank. E.g., instead of downsampling the input from 44.1 kHz to 22.05 kHz and then applying a 1024-point STFT with hop size 441, we can apply a 2048-point STFT with hop size 882 and discard half of the bins. And instead of upsampling from 16 kHz, we can apply a 744-point STFT with hop size 320 and adapt the mel filterbank. Whether this is true depends on which device is used for resampling and the STFT, and how expensive the lowpass filter is (in case of downsampling). Is this too adventurous?
Transformation chain vs. Wrapping: The neural network models in PyTorch are composed of independent transformations that are then chained together. This means information only passes in one direction: earlier transformations can pass information to later transformations. What I assumed in my proposal above was to wrap transformations instead. That is, a transformation (such as an STFT, or mel filterbank) would be a data source that wraps another data source. This allows information to be passed in two directions. When the topmost data source is asked about its data shapes, or a particular excerpt, it can modify the request that is passed down, and the answer coming back.
Design question: Where do we put data augmentation? I assumed in the data source. If we put it in the network, it will need to handle and modify the labels, which would be weird, or the loss function needs to adapt the labels. For this to work, the information on what the augmentation did needs to be shared in some way, or passed through the transformation chain. Furthermore, the network is forward-only, so it cannot tell the data loader to only load the parts that will not be discarded (imagine a time stretching augmentation from 50%-200%: we would always have to load twice the amount of data if we don't know what the augmentation will need).
Design question: Where does batching occur? Above, I assumed it would happen in the data iterator, so the data sources would always just return single data points. (The data iterator would pre-allocate the batch and tell the data sources where to write to, to avoid extra copies and np.stack().) However, always delaying batching until the end is inefficient, since some operations are faster when applied in batches (such as the STFT or mel filterbank). Letting each data source class decide if it implements a per-item getter or a batchwise getter is difficult, unless the bottommost data source always returns batches. However, batching right from the start may be inefficient or too memory-hungry (for very large excerpts, we may want to compute the mel spectrograms separately and only batch them afterwards). A possible solution would be to have a Batcher that the user can insert anywhere (pescador has something like this as well). This would leave some extra burden to the user, but also give more control.
Design question: Who decides the start and length of excerpts, and in what form? Above, I assumed it would be the data iterator. Now imagine an STFT in the data source: The data iterator would only see the reduced lengths and rate, how would it know that it can choose the position on a finer scale (i.e., at sample level)? Should it always just use a number between 0 and 1 for the position, and let the data source map it to something feasible? How can we ensure the labels still match up then?
The proposal in my previous comment does not include a way for efficient streaming at test time. At test time, I usually try to predict on full recordings at once, but for memory constraints, I may have to cut them. To obtain the same results regardless of cutting, I cannot zero-pad the excerpts, I just need to overlap them by the correct amount. Now I don't want to call the getter function again and again with overlapping excerpts -- this would incur redundant computation for the overlaps, and it may incur reopening, seeking and closing the file every time. There should be a stream() method as well that yields a sequence of chunks of a requested length, and some
The proposal above uses an idx to denote the recording, maybe a generic key would be nicer that can also be a str. That would make it easier to combine data sources that do not have the same items (but some overlaps).
When data augmentation is implemented by giving randomness sources to the data iterator, the choice of data points and the choice of excerpts could be implemented in the same way. That would make the data generator base class very generic, it would be reduced to:
```
def feed(self, batchsize, **kwargs):
  while True:
      randomness = {k: rng.sample(batchsize) for k, rng in self.rngs.items()}
      kwargs2 = dict(kwargs)
      kwargs2.update(randomness)
      yield {k: source.get(batchsize, **kwargs2) for k, source in self.sources.items()}
```
But it would also mean the randomness sources (some of them) need information on the items and their lengths, and keep state.

Great, by now I think I've lost everyone? If you don't want to read all of the above, post a sketch proposal for a data loading / iteration API and we can compare. By now I'm still hopeful there's a solution that fits all use cases, but if it gets too complicated, we'll have to sacrifice some use cases.

Speaking of use cases, another challenge I thought of for the design:

Train a network to pitch-shift a mel spectrogram by a given amount. E.g., we want x, y pairs with x shifted by a and y by a + b. The network would receive x and b to reproduce y.

ksanjeevan commented 5 years ago

pitch shifting can be done by scaling the mel/hz conversion, and time stretching by scaling the STFT stride. This will not give a very high quality (e.g., it will do bad things to transients), but I've previously worked with stretching the spectrogram using bicubic or bilinear interpolation, and that seemed good enough for the network

@f0k could you elaborate a bit more? I was using the phase_vocoder for the time stretching and was thinking of how to do the sinc resampling for the pitch shift... looks like you're saying it's not worth the cost in your experience?

Where do we put data augmentation? I assumed in the data source. If we put it in the network, it will need to handle and modify the labels, which would be weird, or the loss function needs to adapt the labels. For this to work, the information on what the augmentation did needs to be shared in some way, or passed through the transformation chain

Could you give an example where having the augmentation in the dataset would be better than after batching in terms of the labels? I think I've had a problem related to this: when randomly time stretching sequences the descending length order is not preserved which is needed for packing, so the information of what augmentation was applied had to flow forward...is this a valid example of what you're referring to?

f0k commented 5 years ago

looks like you're saying it's not worth the cost in your experience?

Yes, but I guess it strongly depends on your task. I was doing voice activity detection for music (see paper and code). Since the network only sees mel spectrograms, doing a high-quality pitch shifting on the time-domain signal is probably overkill (disclaimer: I didn' try).

I think I've had a problem related to this: when randomly time stretching sequences the descending length order is not preserved which is needed for packing, so the information of what augmentation was applied had to flow forward...is this a valid example of what you're referring to?

Yes, that's what I mean. Depending on the task, data augmentation may need to affect both the input and labels. And depending on the data augmentation, you may be able to gain performance by factoring it in right from the start, when loading the excerpts for a batch from disk (that was the time stretching example). Whether it's possible to do this, and how easy it is to do this, depends on several choices:

Does the data augmentation pipeline only pass data in forward direction, or does it pass requests in backward direction first? PyTorch's nn.Module-based networks only pass data in forward direction.
Does the pipeline pass inputs and labels together all the time, or are they two separate pipelines? The former makes it easier to transform both in sync, but may make it harder to write reusable code (e.g., it's easy to write a function that transforms the audio data and labels together for a pitch detection task, but we will want to reuse it for a pipeline that does not need to change the labels, or does not include the labels, or has different types of labels that need to be handled differently, or has different types of input that need to be handled differently). The latter makes it easier to write reusable code, but requires the information on what shifts to apply to what item to be shared between the two pipelines in some way. If the pipeline is forward-only, it could flow forward along with the data, if it's backward-forward, it can flow backwards as part of the request.

Could you give an example where having the augmentation in the dataset would be better than after batching in terms of the labels?

This was based solely on the assumption that the network was based on nn.Modules. They are forward-only, and all the existing layers in nn don't allow passing additional data along with the tensors. On the other hand, we would probably provide our own layers for all of this, so disregard that argument. But the other disadvantage of forward-only remains: We cannot easily adapt the excerpts to load based on the time stretching settings.

f0k commented 5 years ago

Does the pipeline pass inputs and labels together all the time, or are they two separate pipelines? The former makes it easier to transform both in sync, but may make it harder to write reusable code [...]. The latter makes it easier to write reusable code, [...]

Ok, let me challenge this right away -- it's not really needed to have two pipelines for this. It can also be one pipeline that passes on dictionaries of data, and has most nodes only operate on one item of the dictionary (e.g., we would have a node that pitch-shifts spectrograms and does not touch or look at the labels, and another that pitch-shifts labels). This would allow nodes that can operate on multiple items, which may be useful for some purposes. Note that we'll always want to make it easy to define the pipeline in a way that it can also be used for testing, when labels are not available. That's why I thought it's conceptually useful to have separate pipelines for different data sources, so you'd just not instantiate the label data source at test time, and instantiate the audio data source the same way as before. And note that at the very beginning of the pipeline, we probably don't want to have a single node that inserts audio and labels into the pipeline at once, but separate nodes for separate sources (again, to make it easier to reuse them). If we have separate pipelines, they will have separate beginnings, so that's a no-brainer. If we have a joint forward-only pipeline, than we'd have some nodes in the chain add items to the dictionary rather than modifying them. And maybe a single source node that feeds in file names / URIs and excerpt positions, and other nodes that add the chosen data augmentation settings -- in a backward-forward pipeline, that's what would be provided by the request.

So we can also have a single pipeline that passes dictionaries, instead of multiple pipelines that pass tensors. Still a backward-forward pipeline has the advantage that nodes can modify the request as they trickle down, and modify the data as it bubbles up.

/edit: Please continue questioning! This helps getting a clearer picture.

keunwoochoi commented 5 years ago

https://github.com/librosa/librosa/pull/872 Recently librosa added Stream generator.

ATriantafyllopoulos commented 5 years ago

@f0k we have finally open-sourced our package, which you can find here: https://github.com/audeering/audtorch

First a few words regarding its design principles before I try to address your points:

When we started designing it pytorch was still in its 0.x version so there was no native STFT
We use a similar structure as torchvision. This means defining a Dataset object (you can find a collection of open source data sets already there) which defines a transform which operates on the data and a target_transform which operates on the labels.
Data augmentation and feature extraction currently takes place on the CPU, since we are using numpy and librosa as our backend. We are planning to include multiple backends, with torch as our number one priority so this step can be implemented on the GPU as well. However, I can imagine that in the end we need to support both, as some transforms will probably be more efficient on the CPU (with multi-threading).
We decided not to go for dict returns to avoid custom collate functions in the default case, but already support some collate functions for common use cases and plan to add more later
Our supports can be defined as a linear processing chain with each of them operating on the output of the previous one. This currently involves some manual hacking when you need to access a parameter of the transform further down the processing chain. An example that comes to mind is the phase of our Spectrogram transform which we save as an object attribute so it can be accessed later, e.g. for signal reconstruction (https://github.com/audeering/audtorch/blob/346cf73345b1427a890e33d2ebcbb7b4f30874b9/audtorch/transforms/transforms.py#L804)

And now on to your points :

Our pipeline currently works satisfactorily for data sets with a large number of relatively short files (e.g. VoxCeleb1). We are working on a solution that supports loading random excerpts from a collection of large files. Unfortunately, we do not yet have a proper solution for that.
As I mentioned in #40, our pipeline does not work well with resampling. It would be interesting to try your suggested approach there, both in terms of speed and in terms of network performance. I will keep you posted once I do that.
We currently use PyTorch's default DataLoader for batching. This works quite well in my opinion (at least for the use case of many short files). Is there any reason why you would want to change this?
Regarding mini-batches from multiple sources, I assume that it would be better to concatenate Datasets instead of having the DataLoader do that. I also assume that you are talking about the use-case where all Datasets would return the same features/labels. Is that correct?

faroit commented 5 years ago

@ATriantafyllopoulos that looks really great! congrats! I will have a deeper look later.

We are working on a solution that supports loading random excerpts from a collection of large files. Unfortunately, we do not yet have a proper solution for that.

I just had a quick look at the BucketSampler, so this does not address excerpt sampling/chunking of long audio/spectrograms from from a single datasets, right? You have an opinion how to do implement this in your framework?

As long as the iterable dataset isn't added one could do a) use dataset with with fake indices or b) put this logic in the sampler.

ATriantafyllopoulos commented 5 years ago

I just had a quick look at the BucketSampler, so this does not address excerpt sampling/chunking of long audio/spectrograms from from a single datasets, right? You have an opinion how to do implement this in your framework?\

@faroit no this provides a different functionality. It is used to splits samples of one data set in buckets and then sample from the buckets in a specific way.

We are currently experimenting with chunking of long audio. There are two alternative solutions:

a) Save an exhaustive list of offsets/durations for each file. Then use that to index the underlying DataSet and load the appropriate chunk using something like `audiofile.read(..., offset, duration). This has the benefit that PyTorch will take care of iterating through the entire data set, but its drawback is that I am not sure how fast loading a chunk of the file from disk is. It might be that it works well with wavs, but not other formats because of decoding. Fast audio loading is an issue of its own right anyway (#31).

b) Cache a subset of the files in memory and do a potentially non-exhaustive loop over them by loading chunks of a specific length. For this we have run some tests and it is quite fast, but you run into all sorts of problems with sampling bias. For example, it could be that our cache creates an imbalanced class distribution that hinders network training, or that because of the chunking lots of snippets from the same file end up in the same batch thus leading to over-fitting and the like.

So I am not sure if we should provide both solutions or just one, and I am not sure what would be the best way to choose. Perhaps if we have a well-defined problem and data set, and we have an architecture that is easy to train and provably works (e.g. there is a good paper about it), we could move ahead with benchmarking both solutions.

Any suggestions?

faroit commented 5 years ago

So I am not sure if we should provide both solutions or just one, and I am not sure what would be the best way to choose. Perhaps if we have a well-defined problem and data set, and we have an architecture that is easy to train and provably works (e.g. there is a good paper about it), we could move ahead with benchmarking both solutions.

I would be happy to share our source separation repo soon which could serve as a good benchmark for this. I actually implemented a) using torchaudio and pysoundfile as backends which both support seeking. Results were good for wav and flac, but I couldn't saturate the gpu for mp3s... will share that soon.

One downside is a that sample accurate seeking is not possible with mp3 and mp4 without loading the full audio first (slow). That way, seeking i usually implemented using duration in seconds, which could result in not exactly all samples are seen once in batch, however that is properly not an issue in practice.

Again, if we would make loading audio decoding as fast as possible this would make it way easier to do performant chunking. So IMO we really should address #31 first.

f0k commented 5 years ago

we have finally open-sourced our package

Cool!

We currently use PyTorch's default DataLoader for batching. This works quite well in my opinion (at least for the use case of many short files). Is there any reason why you would want to change this?

The problem is that it only has a single integer index for identifying a data point. If we want to identify data points also by the position, we would need to define a bijection from (file, position) to int. If we also want to identify data points by data augmentation parameters, we're lost.

Regarding mini-batches from multiple sources, I assume that it would be better to concatenate Datasets instead of having the DataLoader do that. I also assume that you are talking about the use-case where all Datasets would return the same features/labels. Is that correct?

No, when I was talking about sources, I meant a vertical split of the dataset, not a horizontal one -- one source could be the spectrograms, another could be the labels, another could be self-similarity lag matrices, but usually each source would contain the same data points. My reasoning for splitting things like this is in the first three bullet points in https://github.com/keunwoochoi/torchaudio-contrib/issues/29#issuecomment-477643358. And when we split the implementation like this, one option to join the sources would be in the data loader.

Our supports can be defined as a linear processing chain with each of them operating on the output of the previous one. This currently involves some manual hacking when you need to access a parameter of the transform further down the processing chain.

This would be solved more elegantly by passing on dictionaries which any node can modify. This would allow nodes to pass on multiple items of data.

We use a similar structure as torchvision. This means defining a Dataset object (you can find a collection of open source data sets already there) which defines a transform which operates on the data and a target_transform which operates on the labels.

Just schematically, how do you implement a dataset with a pitch shifting augmentation affecting both the audio and the labels?

ATriantafyllopoulos commented 5 years ago

The problem is that it only has a single integer index for identifying a data point. If we want to identify data points also by the position, we would need to define a bijection from (file, position) to int. If we also want to identify data points by data augmentation parameters, we're lost.

The way we chose to handle this for now is to assume that it is still the Dataset's job for taking care of all that under the hood, and simply provide an interface to the DataLoader for iterating through the data.

Regarding chunking for example, a naive approach would be that the Dataset creates a list of all possible chunks, and then the Dataloader would simply shuffle through all them. This is not ideal because of the overhead of creating and managing that list.

But still, conceptually I still think it is better that Dataset takes care of creating some list (or similar) that the DataLoader then shuffles through, and that the DataLoader takes care of everything from random shuffling, to weighted sampling with and w/o replacement, etc.

No, when I was talking about sources, I meant a vertical split of the dataset, not a horizontal one -- one source could be the spectrograms, another could be the labels, another could be self-similarity lag matrices, but usually each source would contain the same data points. My reasoning for splitting things like this is in the first three bullet points in #29 (comment). And when we split the implementation like this, one option to join the sources would be in the data loader.

I'm still struggling with what you mean here. Let me see if I get this straight:

Each task may require different features. E.g. in one task you might need to work with a spectrogram input and in another you may need to work with a raw audio input. You would like to have a common interface that's independent of the kind of features that are used. Is that correct?
Sometimes, you might need to pre-compute these features for efficiency, so the framework must be able to work with loading features from memory instead of only working with raw audio. Do you think that data augmentation should be part of the pre-processing? Is this what you mean by "indexing by data augmentation" like you said above?
Combining multiple sources: would this be something like: "I need both the spectrogram and the mfccs as input for my model, so get the precomputed MFCCs from memory for efficiency but compute the spectrogram on the fly due to memory constraints and make sure that the MFCCs and the spectrogram correspond to the correct file and possible chunk of a file"?
You would also want potentially different label sources (e.g. one label per file, or labels on different time-scales), and also a generic way to modify and combine those labels. You also expect that sometimes the labels might be derived from the audio itself and thus would need to undergo the same transformation (e.g. in the simple case where you add noise to audio as input and use the original audio as target so that you can train some denoising architecture).

This would be solved more elegantly by passing on dictionaries which any node can modify. This would allow nodes to pass on multiple items of data.

True. In our initial implementation we decided against using dict returns because that would require a specific format for labelling the data set's returned items that we thought was a) too restrictive, b) would only work within the context of our package. But maybe there is indeed good reason to do that.

Just schematically, how do you implement a dataset with a pitch shifting augmentation affecting both the audio and the labels?

Currently, by creating a wrapper data set that accesses the original data set under the hood and takes care of applying the same transformation to the audio and the labels. But I see your point, this would get really complicated if you needed to link multiple transforms that would operate on either the audio alone or on the audio and the labels simultaneously.

This is why I am leaning towards switching back to a dict output as defined above. Then every transform would get a chance to operate on both the audio and the labels, or the audio only, or the labels only, etc. Is this what you have in mind?

f0k commented 5 years ago

Regarding chunking for example, a naive approach would be that the Dataset creates a list of all possible chunks, and then the Dataloader would simply shuffle through all them. This is not ideal because of the overhead of creating and managing that list.

Indeed, see the third paragraph in https://github.com/keunwoochoi/torchaudio-contrib/issues/29#issuecomment-475692083:

If we define each possible excerpt of each possible file to have its own index, we (a) need to map back from indexes to files in __getitem__, and (b) will bias sampling towards longer files. Also this may make sampling without replacement very slow. Many years ago I had an implementation like that, and it took numpy half a minute to shuffle the indices :D

Note that (b) can be avoided by using weighted sampling. But a more efficient sampling scheme is to pick the file at random without replacement, then pick a random position within the file at random with replacement. To enable this, the dataset will need to have some kind of multi-dimensional indexing, with one index specifying the file and another specifying the position within the file. And then the data loader will need to know which indices are valid (e.g., the number of files, and the length of each file), or the data set will need to perform some mapping (e.g., map values from 0.0 to 1.0 to a position within the file).

I'm still struggling with what you mean here. Let me see if I get this straight: [...]

Yes, this is correct. I'm not caring a lot about precomputing features or not, but about modular code. For that reason alone I would want to separate the implementations for different sources (or modalities, or whatever you want to call them). Look at the Rescale class in https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#transforms for a counter-example: It's scaling the image along with the landmark labels. I cannot reuse it to only scale an image, or process an image along with a class label, I would need to copy/paste and adapt it. I think this is a poor design.

In our initial implementation we decided against using dict returns because that would require a specific format for labelling the data set's returned items [...]

The user could tell the nodes at construction time what dict keys they should read and write to. Or did you mean something else?

This is why I am leaning towards switching back to a dict output as defined above. Then every transform would get a chance to operate on both the audio and the labels, or the audio only, or the labels only, etc. Is this what you have in mind?

This is one of the possible solutions I had in mind. There are many different options, that's why the thread has become impossible to read through. Note that having a single node apply a transformation to both the audio and the labels would still be along the lines of the counter-example I linked above. Instead I'd want a node that transforms the audio, and another one that transforms the labels, so depending on my needs, I can use either or both or none of them in my pipeline. And once we agree that this would be a good structure, we need to figure out where the transformation parameters come from, because they need to be the same for the two nodes.

What I thought out loudly above were basically two independent choices for this:

Are the transformation parameters part of the query by the data loader? This would mean the dataset becomes a pipeline with pull semantics (a query passed backwards through the pipeline, and the result coming back forwards). Or are the transformation parameters inserted into the pipeline and passed forward along with the data? This would be a pipeline with push semantics.
Are different sources combined in a single pipeline, or do we have a separate pipeline for each source?

Do you think that data augmentation should be part of the pre-processing? Is this what you mean by "indexing by data augmentation" like you said above?

No, usually it won't be part of the pre-processing -- what I meant was that if the data loader is already responsible for picking the random indices denoting the files or the positions within files, it could also become responsible for picking the amount of pitch shifting and time stretching. Then the dataset (including the transformation pipeline) would be completely deterministic.

ATriantafyllopoulos commented 5 years ago

To enable this, the dataset will need to have some kind of multi-dimensional indexing, with one index specifying the file and another specifying the position within the file.

This is is the solution I am also in favor of.

Yes, this is correct. I'm not caring a lot about precomputing features or not, but about modular code.

Agreed, modularity is what we were aiming for when we designed our transforms. Which is why I think transforms on different modalities or streams should remain independent.

The user could tell the nodes at construction time what dict keys they should read and write to. Or did you mean something else?

Well, maybe this is where I went too far with code modularity. We designed our transforms to depend on the input being what it is supposed to be (e.g. spectrogram expects an array that contains the raw audio you want to get the spectrogram from). I would still like to keep that in case someone would want to use the package without all its components. A workaround would be to provide optional dict arguments. If one is specified, then the transform would assume that the input is a dict, and would operate only on the appropriate parts.

Instead I'd want a node that transforms the audio, and another one that transforms the labels, so depending on my needs, I can use either or both or none of them in my pipeline. And once we agree that this would be a good structure, we need to figure out where the transformation parameters come from, because they need to be the same for the two nodes.

We also chose to process each stream independently. For the cases when we needed to have a transform that operates on both the features and the labels, we did the following: (I am referring to this piece of code)

1) In this instance the labels are generated from the input audio, but they might as well come from a different source, so this step is not so important.

2) We apply some preprocessing to the input before generating the labels.

3) We then need a transform that is jointly applied to the features and the labels. This transform might have some fixed parameters (e.g. window size of the FFT) which are specified as arguments in the instantiation of the transform class. It might also have a "randomness" feature associated with it (e.g. random cropping).

We chose to use the same instance of the transform class to operate both on the features and the labels. In this case, its fixed components are by definition the same. The only problem arises with its "randomness". We chose to make the parameters that control it class attributes, and use a fix_randomization attribute that turns the random generation of these parameters on and off.

4) Finally, we have separate transforms for the features and the labels from that point on.

The problem is that steps 3) and 4) might have to be repeated ad infinitum. We might need to switch back and forth from transforms that operate only on the features to transfoms that operate both on features and labels. What about the following as a solution:

a) We change our Transform classes to optionally operate on dictionary inputs (just so we can make them easy to use in the simpler cases and not break our backwards compatibility).

b) Then we define a processing chain (essentially a Compose transform).

c) In that processing chain, we have three kinds of transforms:

i) The ones that operate on the features only. These would receive an argument such as features.

ii) The ones that operate on labels only, which would get a labels argument.

iii) The ones that operate on both features and labels. These would get a list of args ['features', 'labels']. Then they would be applied to both streams with exactly the same parameters (the fixed ones are defined during instantiation and the randomized ones are defined ones at function call).

Does this sound reasonable?