Open nkundiushuti opened 5 years ago
The problems are: working with datasets which don't fit into memory, and random grouping of chunks from different files. Basically you don't want to end up with a batch made of instances from the same audio file.
To solve this problems I scanned the dataset before and assigned an id to each chunk and each file (I count how many chunks I have per audio file). Then I randomize this list and I pop elements when loading batches.
I do something similar in pyannote.audio:
Gathering those random batches is usually the bottleneck (in terms of training time) in all my experiments. One could probably rely on pescador under the hood (or simply extend pescador with this kind of functionality).
@nkundiushuti thanks for bringing this up. This is a big issue for music researchers where we deal with variable length audio and models are not capable to use long term temporal context so we still use chunks of a few seconds.
The problems are: working with datasets which don't fit into memory, and random grouping of chunks from different files. Basically you don't want to end up with a batch made of instances from the same audio file.
This is actually only one part of the problem, and its not even the biggest one. One can formalize it into how we draw samples for 1 an epoch or 2 for one batch. For both we can choose if we want:
Track | Chunk/Excerpt | |
---|---|---|
A | with replacement | with replacement |
B | with replacement | w/o replacement |
C | w/o replacement | with replacement |
D | w/o replacement | w/o replacement |
That are a lot of combinations actually, and I guess that not many researcher systematically evaluate the performance based on the sampling method being used.
While A is easy to implement using batch generators, research showed that that with-replacement sampling performs worse than without-replacement sampling, however this is valid only for non-convex problems and is probably currently under discussion.
I observed for many models the performance slightly improves if for one epoch all chunks/excerpts are really seen (1D). But the same cannot be said for sampling within a batch. Having only unique tracks in a batch might be nice to have and might help for some classification tasks with small data sets. However, at least for source separation, we couldn't find any improvement compared to sampling chunks with replacement. However, the difference for chunks without replacement (=making sure that you see each chunk only once within one epoch) was significant.
Maybe here I refer to my notebook with various examples here.
In my point of view, it is currently still not clear what the best sampling strategy for music track is is as it depends on the application. So I would propose that all we should aim at, is supporting a simple pytorch implementation for the variants mentioned above.
The main problem here for pytorch dataset
class and the sampler is that it is solely based on indices. Implementing a hierarchical sampling e.g. tracks -> within_tracks, is not very elegant in pytorch since you first would need to determine the exact number of chunks=samples before you could start training. Many times you would end up using "fake indices" to define a dataset. For audio applications it would actually make more sense if we would get a generator/consumer based dataset api.
@keunwoochoi @f0k what is your take on this problem? Should we provide an efficient helper function for track/excerpt sampling?
One could probably rely on pescador under the hood (or simply extend pescador with this kind of functionality).
@hbredin I used pescador in many of my experience, but you loose performance due to the zmq based parallelization. I didn't systematically benchmark any of this, though...
Definitely a good thing to have. What'd be the API for the function though? Have a generator or iterator that produces.. file paths maybe?
Definitely a good thing to have. What'd be the API for the function though? Have a generator or iterator that produces.. file paths maybe?
since the sampling is within a file it would be more or less seeking positions...
actually an IterableDataset
is currently being developed for PyTorch. For the mentioned hierarchical sampling strategies, it probably would make sense to wait for this to be merged and added to PyTorch before we add our own code.
My usual take on this similar: I first collect the labels and the lengths of all audio files, then I extract random excerpts, uniformly sampling first the files (without replacement), then a position within each file (with replacement).
Gathering those random batches is usually the bottleneck
To be fast enough, I store files in .wav format and open them as memory-mapped files (only when accessing them; I cannot keep them open because then I run out of file pointers for the process). This way there is no overhead for decoding or seeking. If I have a lot of RAM, they will end up in the cache over time, which is about as fast as loading them explicitly into memory, but allows multiple processes to train on the same dataset at the same time. If I don't have a lot of RAM, they'll be loaded from disk, and I place the files on one or two SSDs. To avoid the main thread being bothered with I/O, I defer loading into a background thread with a simple wrapper of the generator. When I have a lot of CPU-based augmentation, I use multiple background threads.
The main problem here for pytorch dataset class and the sampler is that it is solely based indices.
Yes, that wouldn't make sense. If we define each possible excerpt of each possible file to have its own index, we (a) need to map back from indexes to files in __getitem__
, and (b) will bias sampling towards longer files. Also this may make sampling without replacement very slow. Many years ago I had an implementation like that, and it took numpy half a minute to shuffle the indices :D
I count how many chunks I have per audio file
In my "it took numpy half a minute to shuffle" case, the number of chunks per file was the number of spectrogram frames minus the excerpt length in frames plus 1. In my current implementations, the number of chunks per file is the number of audio samples minus the excerpt length in samples + 1.
Should we provide an efficient helper function for track/excerpt sampling?
I'm unsure. I think yes, but it can also become quite a big project that might rather be deferred to another library (I should look at pescador again). I'd like to collect some common use cases / corner cases / aspects that I'd want to be either covered, or not hindered by the implementation.
I don't have a solution satisfying all of this yet, and I'm not sure whether we should build this in torchaudio, but it would be nice to have.
For the mentioned hierarchical sampling strategies, it probably would make sense to wait for this to be merged and added to PyTorch before we add our own code.
Yes, we should take care to stay compatible to PyTorch's machinery, so we can benefit from it.
@f0k thanks for your detailed insights.
My usual take on this similar: I first collect the labels and the lengths of all audio files, then I extract random excerpts, uniformly sampling first the files (without replacement), then a position within each file (with replacement).
In this setting, how many excerpts/samples do you yield in total and do you usually evaluate this parameter?
Yes, that wouldn't make sense. If we define each possible excerpt of each possible file to have its own index, we (a) need to map back from indexes to files in getitem, and (b) will bias sampling towards longer files. Also this may make sampling without replacement very slow. Many years ago I had an implementation like that, and it took numpy half a minute to shuffle the indices 🗡
Yes, we should take care to stay compatible to PyTorch's machinery, so we can benefit from it.
So just to summarize, you are doing tracks (without replacement) and excerpts (without replacement) because it is the only way to efficiently handle very large data sets, right?
I think that means, for now its probably the best to stick with indices based sampling where the indices are only assigned to the number of tracks. The downsides are 1. not seeing all excerpts might not be optimal for small datasets 2. for datasets of very few and very long tracks the definition of an epoch becomes meaningless since they would be super short. That also affects the maximum batch size to be used.
I will implement a proposal for this soon
I'd like to collect some common use cases / corner cases / aspects that I'd want to be either covered, or not hindered by the implementation.
Thats a great listing. Many of the can be handled my pescador but yes, we should think about how to do that in torch audio later.
In this setting, how many excerpts/samples do you yield in total and do you usually evaluate this parameter?
It's basically an infinite iterator, and I yield mini-batches until the validation error does not improve any more, checking the validation error (and possibly adapting the learning rates) every k updates (I sometimes refer to this as a mini-epoch).
So just to summarize, you are doing tracks (without replacement) and excerpts (without replacement) because it is the only way to efficiently handle very large data sets, right?
Excerpts with replacement! I don't want to memorize which excerpts per track have been seen already, so I just hope independently random positions will do.
I think that means, for now its probably the best to stick with indices based sampling where the indices are only assigned to the number of tracks.
Yes, but ideally, it would be easy to change the sampling strategy.
for datasets of very few and very long tracks the definition of an epoch becomes meaningless
I think the definition of an epoch is always kind of meaningless when we train on excerpts. For a small dataset of not too long files, it still doesn't make sense to present all possible excerpts of all possible files as an epoch -- excerpts from the same file will be very similar, even if they don't overlap, so it's redundant to go through all of them. I'm very happy with decoupling the epoch size from the dataset size (and the batch size).
I will implement a proposal for this soon
Don't go too fast! I think what we first need would be a proposal for the API. Not a complete implementation, but a definition of the functions or classes and methods, with docstrings if needed, but without bodies. At this stage it's much easier to change things around than when we already have code with it (that may even have to be thrown away).
Whenever I think about it, I go through the following:
get(file_id, position, length)
function, because we may want to include data augmentation, and for some tasks and augmentations this cannot be done independently for the labels and audio.get(**kwargs)
function. The latter would generate mini-batches from the data sources, providing whatever randomness is needed in the kwargs
.get()
implementation would call the underlying get()
function, modify the result, and return it. Compared to other ways of chaining transformations, this allows them to modify the arguments to the get()
call. For example, if a time stretching augmentation is asked to provide a 10-second excerpt at 150% speed, it will need to ask the underlying data source for a 15-second excerpt.get()
function, the data sources should provide the shapes and dtypes of their tensors.I'm happy to discuss advantages and shortcomings of this design, or completely different designs.
Many of the can be handled my pescador
I really need to look again why I decided not to use it!
Many of the can be handled my pescador
I really need to look again why I decided not to use it!
Okay, I read through the documentation and some of the code. pescador provides a Streamer
, which wraps a generator function along with its args and kwargs, and then can start the generator whenever needed. Such an extra abstraction is important -- also the data iterator in my previous comment would return a generator whenever asked for it. That's what allows data iteration to be parallelized across multiple workers.
In addition, pescador provides ways to interleave different streams. This allows to implement interesting forms of hierarchical sampling. However, pescador assumes that the streams already provide matching samples of multiple data sources (e.g., inputs and targets). It uses dictionaries of ndarrays, which I've converged on as well (and then provides tools to assemble dictionaries of samples into dictionaries of batches, and to convert dictionaries into tuples to interface with APIs that require them).
What I'm thinking about here is how to provide these matching samples of multiple data sources.
Just to be clear, I'm not looking for a more efficient simulation of the following:
The following should be covered, but is still not what I'm worried about:
What troubles me is the following:
I've been tossing some ideas around, but I'm not happy yet. What I'd want as an end user is to define an audio data source like this:
audio = AudioFileSource(filenames, ...)
And a label source like this:
class LabelsForThisTask(DataSource):
def __init__(self, filenames, labeldir):
# load the segments for each filename
def __len__(self):
return len(self.segments)
def get(self, idx, start, end, rate):
return self.get_at(idx, (start + end) / 2)
def get_at(self, idx, position):
segments = self.segments[idx]
segment = np.searchsorted(segments['bounds'], position)
return segments['labels'][segment]
labels = LabelsForThisTask(filenames, labeldir)
Or actually, I guess this class would exist already, but in general I'd want to be able to define the labels in terms of a function that returns the label for a particular file and position or range.
And data iterators like this:
train_loader = RandomExcerptLoader(dict(x=audio, y=labels), length=10)
valid_loader = ExcerptLoader(dict(x=AudioFileSource(filenames_val), y=LabelsForThisTask(filenames_val, labeldir)), min_length=2, max_length=30, max_per_file=1)
test_loader = ExcerptLoader(dict(x=Padding(AudioFileSource(filenames_test), pad=Fraction(51, 70), mode='reflect'), name=StringSource(filenames_test)), max_length=60, overlap=Fraction(51, 70))
And then be able to call them in the training code:
batches = train_loader.feed(batchsize=batchsize, infinite=True)
batches = generate_in_background(batches)
for epoch in epochs:
for _ in trange(epochsize):
batch = next(batches)
training_step(**batch)
for batch in valid_loader.feed(batchsize=1):
valid_step(**batch)
But the parts in between are not completely clear to me. My first draft for the data source base class was this:
class DataSource(object):
"""
Encapsulates a list of tensors of the same dimensionality, but possibly
different shapes. `shape` and `dtype` can be passed to populate the
corresponding properties, otherwise they will be inferred when first
accessed. If `timeless` is given, the items do not have a time dimension
that can be accessed in `get`; they will be the same for any excerpt.
"""
def __init__(self, shape=None, dtype=None, timeless=False):
self._shape = shape
self._dtype = dtype
self.timeless = timeless
def __len__(self):
# To be implemented in subclasses
raise NotImplementedError()
def shape_of(self, idx):
# To be implemented in subclasses
raise NotImplementedError()
def dtype_of(self, idx):
# To be implemented in subclasses
raise NotImplementedError()
def get(self, idx, start=None, stop=None, stride=None):
# To be implemented in subclasses
raise NotImplementedError()
@property
def shape(self):
if self._shape is None and len(self) > 0:
shape = self.shape_of(0)
for idx in range(1, len(self)):
shape = tuple(a if (a is not None) and (a == b) else None
for a, b in zip(shape, self.shape_of(idx)))
self._shape = shape
return self._shape
@property
def dtype(self):
if self._dtype is None and len(self) > 0:
dtype = dtype_of(0)
if any(dtype != self.dtype_of(idx) for idx in range(1, len(self))):
dtype = None
self._dtype = dtype
return self._dtype
def __getitem__(self, key):
if isinstance(key, int):
return self.get(key)
elif isinstance(key, slice):
start, stop, stride = key.indices(len(self))
return [self.get(idx) for idx in range(start, stop, stride)]
else:
raise KeyError('Unsupported key %r, expected int or slice' % key)
timeless
flag. Could be a parent class. Or, to make it more generic, the data source could tell how many (or which) dimensions are meaningful to take excerpts of, so it would support global labels, sequences, images and volumes.rate
property similar to shape
and dtype
, and a rate_of
method. But what would that mean for label sources, which are not discretized? Should this become a parent class again? Also, the get
function for label sequences should accept a rate
at which it is meant to be sampled. Audio sources could get the same keyword argument, forcing them to resample. But what is the unit of start
and end
? Would it be in terms of rate
? And if rate
is omitted, start
and end
are in terms of the native rate? Is start
and end
good, or should it be start
and length
, or even start
and count
? Who provides the rate
for label sources? The data iterator? Or should this be fixed at construction time? Do we need it to be variable for any of the use cases?get
. Data sources that request randomness with the same name would get the same values. Possibly a bit brittle.Well, sorry for the long post, I hope to spur some discussion!
Data sources that request randomness with the same name would get the same values. Possibly a bit brittle.
Actually, if the user can override which key each data source uses, and can tell the iterator what source of randomness to use for each key, it's not so brittle any more. Each augmenting data source would use some default key (such as pitch_shift
for a raw audio pitch shifter, or a pitch shifting mel filterbank, or a pitch shifting label transformator) that could be overridden for special purposes. I'd imagine something like this:
audio = AudioFileSource(...)
audio = STFT(audio, ...)
audio = ShiftingMelFilterbank(audio, ...)
labels = ...
labels = LabelPitchShift(labels, ...)
train_loader = RandomExcerptLoader(dict(x=audio, y=labels), rngs=dict(pitch_shift=Uniform(0.7, 1.3)))
And the loader would do something like:
def feed(batchsize, infinite=False, drop_remainder=True):
batch = {k: np.empty(...) for k in self.sources}
idxs = np.arange(len(self))
while True:
np.random.shuffle(idxs)
count = 0
for idx in idxs:
randomness = {k: v.sample() for k, v in self.rngs}
for k, source in self.sources.items():
batch[k][count] = source.get(idx, *randomness)
count += 1
if count == batchsize:
yield batch
count = 0
if count < batchsize and not drop_remainder:
yield {k: data[:count] for k, data in batch.items()}
if not infinite:
break
Aaaand some more thoughts and design questions.
stream()
method as well that yields a sequence of chunks of a requested length, and some idx
to denote the recording, maybe a generic key
would be nicer that can also be a str
. That would make it easier to combine data sources that do not have the same items (but some overlaps).def feed(self, batchsize, **kwargs):
while True:
randomness = {k: rng.sample(batchsize) for k, rng in self.rngs.items()}
kwargs2 = dict(kwargs)
kwargs2.update(randomness)
yield {k: source.get(batchsize, **kwargs2) for k, source in self.sources.items()}
But it would also mean the randomness sources (some of them) need information on the items and their lengths, and keep state.
Great, by now I think I've lost everyone? If you don't want to read all of the above, post a sketch proposal for a data loading / iteration API and we can compare. By now I'm still hopeful there's a solution that fits all use cases, but if it gets too complicated, we'll have to sacrifice some use cases.
Speaking of use cases, another challenge I thought of for the design:
x
, y
pairs with x
shifted by a
and y
by a + b
. The network would receive x
and b
to reproduce y
.pitch shifting can be done by scaling the mel/hz conversion, and time stretching by scaling the STFT stride. This will not give a very high quality (e.g., it will do bad things to transients), but I've previously worked with stretching the spectrogram using bicubic or bilinear interpolation, and that seemed good enough for the network
@f0k could you elaborate a bit more? I was using the phase_vocoder for the time stretching and was thinking of how to do the sinc
resampling for the pitch shift... looks like you're saying it's not worth the cost in your experience?
Where do we put data augmentation? I assumed in the data source. If we put it in the network, it will need to handle and modify the labels, which would be weird, or the loss function needs to adapt the labels. For this to work, the information on what the augmentation did needs to be shared in some way, or passed through the transformation chain
Could you give an example where having the augmentation in the dataset would be better than after batching in terms of the labels? I think I've had a problem related to this: when randomly time stretching sequences the descending length order is not preserved which is needed for packing, so the information of what augmentation was applied had to flow forward...is this a valid example of what you're referring to?
looks like you're saying it's not worth the cost in your experience?
Yes, but I guess it strongly depends on your task. I was doing voice activity detection for music (see paper and code). Since the network only sees mel spectrograms, doing a high-quality pitch shifting on the time-domain signal is probably overkill (disclaimer: I didn' try).
I think I've had a problem related to this: when randomly time stretching sequences the descending length order is not preserved which is needed for packing, so the information of what augmentation was applied had to flow forward...is this a valid example of what you're referring to?
Yes, that's what I mean. Depending on the task, data augmentation may need to affect both the input and labels. And depending on the data augmentation, you may be able to gain performance by factoring it in right from the start, when loading the excerpts for a batch from disk (that was the time stretching example). Whether it's possible to do this, and how easy it is to do this, depends on several choices:
nn.Module
-based networks only pass data in forward direction.Could you give an example where having the augmentation in the dataset would be better than after batching in terms of the labels?
This was based solely on the assumption that the network was based on nn.Module
s. They are forward-only, and all the existing layers in nn
don't allow passing additional data along with the tensors. On the other hand, we would probably provide our own layers for all of this, so disregard that argument. But the other disadvantage of forward-only remains: We cannot easily adapt the excerpts to load based on the time stretching settings.
Does the pipeline pass inputs and labels together all the time, or are they two separate pipelines? The former makes it easier to transform both in sync, but may make it harder to write reusable code [...]. The latter makes it easier to write reusable code, [...]
Ok, let me challenge this right away -- it's not really needed to have two pipelines for this. It can also be one pipeline that passes on dictionaries of data, and has most nodes only operate on one item of the dictionary (e.g., we would have a node that pitch-shifts spectrograms and does not touch or look at the labels, and another that pitch-shifts labels). This would allow nodes that can operate on multiple items, which may be useful for some purposes. Note that we'll always want to make it easy to define the pipeline in a way that it can also be used for testing, when labels are not available. That's why I thought it's conceptually useful to have separate pipelines for different data sources, so you'd just not instantiate the label data source at test time, and instantiate the audio data source the same way as before. And note that at the very beginning of the pipeline, we probably don't want to have a single node that inserts audio and labels into the pipeline at once, but separate nodes for separate sources (again, to make it easier to reuse them). If we have separate pipelines, they will have separate beginnings, so that's a no-brainer. If we have a joint forward-only pipeline, than we'd have some nodes in the chain add items to the dictionary rather than modifying them. And maybe a single source node that feeds in file names / URIs and excerpt positions, and other nodes that add the chosen data augmentation settings -- in a backward-forward pipeline, that's what would be provided by the request.
So we can also have a single pipeline that passes dictionaries, instead of multiple pipelines that pass tensors. Still a backward-forward pipeline has the advantage that nodes can modify the request as they trickle down, and modify the data as it bubbles up.
/edit: Please continue questioning! This helps getting a clearer picture.
https://github.com/librosa/librosa/pull/872 Recently librosa added Stream
generator.
@f0k we have finally open-sourced our package, which you can find here: https://github.com/audeering/audtorch
First a few words regarding its design principles before I try to address your points:
When we started designing it pytorch was still in its 0.x version so there was no native STFT
We use a similar structure as torchvision
. This means defining a Dataset
object (you can find a collection of open source data sets already there) which defines a transform
which operates on the data and a target_transform
which operates on the labels.
Data augmentation and feature extraction currently takes place on the CPU, since we are using numpy
and librosa
as our backend. We are planning to include multiple backends, with torch
as our number one priority so this step can be implemented on the GPU as well. However, I can imagine that in the end we need to support both, as some transforms will probably be more efficient on the CPU (with multi-threading).
We decided not to go for dict
returns to avoid custom collate functions in the default case, but already support some collate functions for common use cases and plan to add more later
Our supports can be defined as a linear processing chain with each of them operating on the output of the previous one. This currently involves some manual hacking when you need to access a parameter of the transform further down the processing chain. An example that comes to mind is the phase of our Spectrogram
transform which we save as an object attribute so it can be accessed later, e.g. for signal reconstruction (https://github.com/audeering/audtorch/blob/346cf73345b1427a890e33d2ebcbb7b4f30874b9/audtorch/transforms/transforms.py#L804)
And now on to your points :
Our pipeline currently works satisfactorily for data sets with a large number of relatively short files (e.g. VoxCeleb1
). We are working on a solution that supports loading random excerpts from a collection of large files. Unfortunately, we do not yet have a proper solution for that.
As I mentioned in #40, our pipeline does not work well with resampling
. It would be interesting to try your suggested approach there, both in terms of speed and in terms of network performance. I will keep you posted once I do that.
We currently use PyTorch's default DataLoader
for batching. This works quite well in my opinion (at least for the use case of many short files). Is there any reason why you would want to change this?
Regarding mini-batches from multiple sources, I assume that it would be better to concatenate Datasets
instead of having the DataLoader
do that. I also assume that you are talking about the use-case where all Datasets
would return the same features/labels. Is that correct?
@ATriantafyllopoulos that looks really great! congrats! I will have a deeper look later.
We are working on a solution that supports loading random excerpts from a collection of large files. Unfortunately, we do not yet have a proper solution for that.
I just had a quick look at the BucketSampler
, so this does not address excerpt sampling/chunking of long audio/spectrograms from from a single datasets, right? You have an opinion how to do implement this in your framework?
As long as the iterable dataset isn't added one could do a) use dataset with with fake indices or b) put this logic in the sampler.
I just had a quick look at the
BucketSampler
, so this does not address excerpt sampling/chunking of long audio/spectrograms from from a single datasets, right? You have an opinion how to do implement this in your framework?\
@faroit no this provides a different functionality. It is used to splits samples of one data set in buckets and then sample from the buckets in a specific way.
We are currently experimenting with chunking of long audio. There are two alternative solutions:
a) Save an exhaustive list of offsets/durations for each file. Then use that to index the underlying DataSet
and load the appropriate chunk using something like `audiofile.read(..., offset, duration). This has the benefit that PyTorch will take care of iterating through the entire data set, but its drawback is that I am not sure how fast loading a chunk of the file from disk is. It might be that it works well with wavs, but not other formats because of decoding. Fast audio loading is an issue of its own right anyway (#31).
b) Cache a subset of the files in memory and do a potentially non-exhaustive loop over them by loading chunks of a specific length. For this we have run some tests and it is quite fast, but you run into all sorts of problems with sampling bias. For example, it could be that our cache creates an imbalanced class distribution that hinders network training, or that because of the chunking lots of snippets from the same file end up in the same batch thus leading to over-fitting and the like.
So I am not sure if we should provide both solutions or just one, and I am not sure what would be the best way to choose. Perhaps if we have a well-defined problem and data set, and we have an architecture that is easy to train and provably works (e.g. there is a good paper about it), we could move ahead with benchmarking both solutions.
Any suggestions?
So I am not sure if we should provide both solutions or just one, and I am not sure what would be the best way to choose. Perhaps if we have a well-defined problem and data set, and we have an architecture that is easy to train and provably works (e.g. there is a good paper about it), we could move ahead with benchmarking both solutions.
I would be happy to share our source separation repo soon which could serve as a good benchmark for this. I actually implemented a) using torchaudio and pysoundfile as backends which both support seeking. Results were good for wav and flac, but I couldn't saturate the gpu for mp3s... will share that soon.
One downside is a that sample accurate seeking is not possible with mp3 and mp4 without loading the full audio first (slow). That way, seeking i usually implemented using duration in seconds, which could result in not exactly all samples are seen once in batch, however that is properly not an issue in practice.
Again, if we would make loading audio decoding as fast as possible this would make it way easier to do performant chunking. So IMO we really should address #31 first.
we have finally open-sourced our package
Cool!
We currently use PyTorch's default
DataLoader
for batching. This works quite well in my opinion (at least for the use case of many short files). Is there any reason why you would want to change this?
The problem is that it only has a single integer index for identifying a data point. If we want to identify data points also by the position, we would need to define a bijection from (file, position) to int. If we also want to identify data points by data augmentation parameters, we're lost.
Regarding mini-batches from multiple sources, I assume that it would be better to concatenate
Datasets
instead of having theDataLoader
do that. I also assume that you are talking about the use-case where allDatasets
would return the same features/labels. Is that correct?
No, when I was talking about sources, I meant a vertical split of the dataset, not a horizontal one -- one source could be the spectrograms, another could be the labels, another could be self-similarity lag matrices, but usually each source would contain the same data points. My reasoning for splitting things like this is in the first three bullet points in https://github.com/keunwoochoi/torchaudio-contrib/issues/29#issuecomment-477643358. And when we split the implementation like this, one option to join the sources would be in the data loader.
Our supports can be defined as a linear processing chain with each of them operating on the output of the previous one. This currently involves some manual hacking when you need to access a parameter of the transform further down the processing chain.
This would be solved more elegantly by passing on dictionaries which any node can modify. This would allow nodes to pass on multiple items of data.
We use a similar structure as torchvision. This means defining a
Dataset
object (you can find a collection of open source data sets already there) which defines atransform
which operates on the data and atarget_transform
which operates on the labels.
Just schematically, how do you implement a dataset with a pitch shifting augmentation affecting both the audio and the labels?
The problem is that it only has a single integer index for identifying a data point. If we want to identify data points also by the position, we would need to define a bijection from (file, position) to int. If we also want to identify data points by data augmentation parameters, we're lost.
The way we chose to handle this for now is to assume that it is still the Dataset
's job for taking care of all that under the hood, and simply provide an interface to the DataLoader
for iterating through the data.
Regarding chunking for example, a naive approach would be that the Dataset
creates a list of all possible chunks, and then the Dataloader
would simply shuffle through all them. This is not ideal because of the overhead of creating and managing that list.
But still, conceptually I still think it is better that Dataset
takes care of creating some list (or similar) that the DataLoader
then shuffles through, and that the DataLoader
takes care of everything from random shuffling, to weighted sampling with and w/o replacement, etc.
No, when I was talking about sources, I meant a vertical split of the dataset, not a horizontal one -- one source could be the spectrograms, another could be the labels, another could be self-similarity lag matrices, but usually each source would contain the same data points. My reasoning for splitting things like this is in the first three bullet points in #29 (comment). And when we split the implementation like this, one option to join the sources would be in the data loader.
I'm still struggling with what you mean here. Let me see if I get this straight:
Each task may require different features. E.g. in one task you might need to work with a spectrogram input and in another you may need to work with a raw audio input. You would like to have a common interface that's independent of the kind of features that are used. Is that correct?
Sometimes, you might need to pre-compute these features for efficiency, so the framework must be able to work with loading features from memory instead of only working with raw audio. Do you think that data augmentation should be part of the pre-processing? Is this what you mean by "indexing by data augmentation" like you said above?
Combining multiple sources: would this be something like: "I need both the spectrogram and the mfccs as input for my model, so get the precomputed MFCCs from memory for efficiency but compute the spectrogram on the fly due to memory constraints and make sure that the MFCCs and the spectrogram correspond to the correct file and possible chunk of a file"?
You would also want potentially different label sources (e.g. one label per file, or labels on different time-scales), and also a generic way to modify and combine those labels. You also expect that sometimes the labels might be derived from the audio itself and thus would need to undergo the same transformation (e.g. in the simple case where you add noise to audio as input and use the original audio as target so that you can train some denoising architecture).
This would be solved more elegantly by passing on dictionaries which any node can modify. This would allow nodes to pass on multiple items of data.
True. In our initial implementation we decided against using dict
returns because that would require a specific format for labelling the data set's returned items that we thought was a) too restrictive, b) would only work within the context of our package. But maybe there is indeed good reason to do that.
Just schematically, how do you implement a dataset with a pitch shifting augmentation affecting both the audio and the labels?
Currently, by creating a wrapper data set that accesses the original data set under the hood and takes care of applying the same transformation to the audio and the labels. But I see your point, this would get really complicated if you needed to link multiple transforms that would operate on either the audio alone or on the audio and the labels simultaneously.
This is why I am leaning towards switching back to a dict
output as defined above. Then every transform would get a chance to operate on both the audio and the labels, or the audio only, or the labels only, etc. Is this what you have in mind?
Regarding chunking for example, a naive approach would be that the
Dataset
creates a list of all possible chunks, and then theDataloader
would simply shuffle through all them. This is not ideal because of the overhead of creating and managing that list.
Indeed, see the third paragraph in https://github.com/keunwoochoi/torchaudio-contrib/issues/29#issuecomment-475692083:
If we define each possible excerpt of each possible file to have its own index, we (a) need to map back from indexes to files in
__getitem__
, and (b) will bias sampling towards longer files. Also this may make sampling without replacement very slow. Many years ago I had an implementation like that, and it took numpy half a minute to shuffle the indices :D
Note that (b) can be avoided by using weighted sampling. But a more efficient sampling scheme is to pick the file at random without replacement, then pick a random position within the file at random with replacement. To enable this, the dataset will need to have some kind of multi-dimensional indexing, with one index specifying the file and another specifying the position within the file. And then the data loader will need to know which indices are valid (e.g., the number of files, and the length of each file), or the data set will need to perform some mapping (e.g., map values from 0.0 to 1.0 to a position within the file).
I'm still struggling with what you mean here. Let me see if I get this straight: [...]
Yes, this is correct. I'm not caring a lot about precomputing features or not, but about modular code. For that reason alone I would want to separate the implementations for different sources (or modalities, or whatever you want to call them). Look at the Rescale
class in https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#transforms for a counter-example: It's scaling the image along with the landmark labels. I cannot reuse it to only scale an image, or process an image along with a class label, I would need to copy/paste and adapt it. I think this is a poor design.
In our initial implementation we decided against using dict returns because that would require a specific format for labelling the data set's returned items [...]
The user could tell the nodes at construction time what dict keys they should read and write to. Or did you mean something else?
This is why I am leaning towards switching back to a dict output as defined above. Then every transform would get a chance to operate on both the audio and the labels, or the audio only, or the labels only, etc. Is this what you have in mind?
This is one of the possible solutions I had in mind. There are many different options, that's why the thread has become impossible to read through. Note that having a single node apply a transformation to both the audio and the labels would still be along the lines of the counter-example I linked above. Instead I'd want a node that transforms the audio, and another one that transforms the labels, so depending on my needs, I can use either or both or none of them in my pipeline. And once we agree that this would be a good structure, we need to figure out where the transformation parameters come from, because they need to be the same for the two nodes.
What I thought out loudly above were basically two independent choices for this:
Do you think that data augmentation should be part of the pre-processing? Is this what you mean by "indexing by data augmentation" like you said above?
No, usually it won't be part of the pre-processing -- what I meant was that if the data loader is already responsible for picking the random indices denoting the files or the positions within files, it could also become responsible for picking the amount of pitch shifting and time stretching. Then the dataset (including the transformation pipeline) would be completely deterministic.
To enable this, the dataset will need to have some kind of multi-dimensional indexing, with one index specifying the file and another specifying the position within the file.
This is is the solution I am also in favor of.
Yes, this is correct. I'm not caring a lot about precomputing features or not, but about modular code.
Agreed, modularity is what we were aiming for when we designed our transforms. Which is why I think transforms on different modalities or streams should remain independent.
The user could tell the nodes at construction time what dict keys they should read and write to. Or did you mean something else?
Well, maybe this is where I went too far with code modularity. We designed our transforms to depend on the input being what it is supposed to be (e.g. spectrogram expects an array that contains the raw audio you want to get the spectrogram from). I would still like to keep that in case someone would want to use the package without all its components. A workaround would be to provide optional
dict
arguments. If one is specified, then the transform would assume that the input is a dict, and would operate only on the appropriate parts.Instead I'd want a node that transforms the audio, and another one that transforms the labels, so depending on my needs, I can use either or both or none of them in my pipeline. And once we agree that this would be a good structure, we need to figure out where the transformation parameters come from, because they need to be the same for the two nodes.
We also chose to process each stream independently. For the cases when we needed to have a transform that operates on both the features and the labels, we did the following: (I am referring to this piece of code)
1) In this instance the labels are generated from the input audio, but they might as well come from a different source, so this step is not so important.
2) We apply some preprocessing to the input before generating the labels.
3) We then need a transform that is jointly applied to the features and the labels. This transform might have some fixed parameters (e.g. window size of the FFT) which are specified as arguments in the instantiation of the transform class. It might also have a "randomness" feature associated with it (e.g. random cropping).
We chose to use the same instance of the transform class to operate both on the features and the labels. In this case, its fixed components are by definition the same. The only problem arises with its "randomness". We chose to make the parameters that control it class attributes, and use a fix_randomization
attribute that turns the random generation of these parameters on and off.
4) Finally, we have separate transforms for the features and the labels from that point on.
The problem is that steps 3) and 4) might have to be repeated ad infinitum. We might need to switch back and forth from transforms that operate only on the features to transfoms that operate both on features and labels. What about the following as a solution:
a) We change our Transform
classes to optionally operate on dictionary inputs (just so we can make them easy to use in the simpler cases and not break our backwards compatibility).
b) Then we define a processing chain (essentially a Compose
transform).
c) In that processing chain, we have three kinds of transforms:
i) The ones that operate on the features only. These would receive an argument such as features
.
ii) The ones that operate on labels only, which would get a labels
argument.
iii) The ones that operate on both features and labels. These would get a list of args ['features', 'labels']
. Then they would be applied to both streams with exactly the same parameters (the fixed ones are defined during instantiation and the randomized ones are defined ones at function call).
Does this sound reasonable?
Hi!
I have faced this issue for the past years. I wonder what are the options. I will present my solution which of course is not the best either the most efficient. The problems are: working with datasets which don't fit into memory, and random grouping of chunks from different files. Basically you don't want to end up with a batch made of instances from the same audio file. To solve this problems I scanned the dataset before and assigned an id to each chunk and each file (I count how many chunks I have per audio file). Then I randomize this list and I pop elements when loading batches. Data augmentation: a simple but efficient augmentation that can be applied at this step is to have overlapping chunks. In source separation or inverse problems this is required at the output stage too otherwise you have some discontinuities between chunks. Other data augmentation? Maybe a separate topic. Jan should have more experience with this. I am doing it as a post-processing step but it can be applied layer-wise (augmentation parameter list in which you have different options which are applied at batch level, similarly to batch normalization).