pescadores / pescador

Stochastic multi-stream sampling for iterative learning
https://pescador.readthedocs.io
ISC License
76 stars 12 forks source link

Howto create minibatches with independent samples #136

Closed faroit closed 5 years ago

faroit commented 6 years ago

TL;DR: I am looking for a way to mux streams and aggregate them in batches while making sure that each batch contains only 1 sample from each stream (if the batchsize is smaller than the number of streams).

For quite some time now I wonder about an aspect of data sampling in the context of training deep models for hierarchical-like data set. Lets return (#133) to the infamous scenario of music tracks that are trained on smaller excerpts instead of using the whole track so that non-dynamic models such as CNNs can be used.

So lets assume we have 10 tracks and each track yields 100 excerpts (with our without overlap), this would result in 1000 samples. Now we have a few ways to sample for one training epoch, and one way is to sample randomly and select batchsize samples for each batch and repeat the procedure (without replacement) for a fixed number of iterations in each epoch. While this is fast and easy to implement – even without pescador – it comes with two downsides:

1) it is likely that we are not going to see all data in one epoch.

This can be compensated by increasing the number of iterations or epochs. Also there is a proof that with-replacement sampling performs worse than without-replacement sampling, but this is only valid for non-convex problems. I forgot where I read it but for DNNs it could actually result in improved generalisation, which probably why it turned out to be successfully deployed in audio for such as singing voice detection paper from @f0k.

2) we end up having multiple excerpts from the same track in one batch

I currently don't know if this has an impact on performance, but there is some work that deals with these kind of problem for small-batch sizes and/or non-i.i.d samples. Also I know some methods in source separation use this kind of sampling (e.g. the ICASSP 2015 work from @TE-StefanUhlich). Therefore I want to run a small comparison and I hoped I could implement all flavors using pescador.

So, with pescador we can easily make sure that all samples are seen once using StochasticMux. However if we use the buffer_stream we cannot easily make sure that all samples are from different tracks. A workaround would be to use RoundRobin if batchsize is smaller than the number of tracks and shuffle the streamers every epoch:

for epoch in range(k):
    shuffle(streams)
    mux = pescador.RoundRobinMux(track_streams, mode='exhaustive')
    batches = pescador.buffer_stream(mux, nb_batches)
    for batch in batches:
        ...

however than we would end up having repetition (e.g. for three tracks and batchsize 2: [[a, b], [c, a], [b, c], [a, b]]).

I would be glad if someone could show me how to implement this in pescador or can possibly point me to some theory papers were this is discussed. Also maybe this would something to consider for the pescador paper you are planning to write.

bmcfee commented 6 years ago

TL;DR: I am looking for a way to mux streams and aggregate them in batches while making sure that each batch contains only 1 sample from each stream (if the batchsize is smaller than the number of streams).

If batch size == k * # streams (for integer k), then RoundRobin will do the job. Indeed, that's what we put it in for. :)

If batch size < # streams, then RoundRobin will guarantee at most 1 sample per stream per batch, but as you observe, you'll get interactions between batches that depends on the ratio between the two numbers, and the order of the streamers within the mux.

The epoch-shuffling idea you posted is interesting, and similar to something the RR already does. A slightly more pescadorian approach to this would be to limit each streamer to produce a small number of samples (== the number before the order will change) and self-terminate, then use mode='permuted_cycle option of RoundRobinMux to re-order the streamers before reactivating. This way you don't have to wait for a full epoch before the order changes.

If each streamer generates exactly 1 random example, this should give you exactly the desired behavior of fully randomized order with balanced presentation. However, you'll have to pay in overhead of reactivating the streamers, so it might not be worth it. My untested, gut feeling, is to set the per-streamer limit such that the permutation order changes several times per epoch. That "several times" may depend on how many total streamers you have: if it's small, then epoch/n_streams! might be feasible; if it's large, then it probably doesn't matter too much because you'll never observe every order anyway.

bmcfee commented 6 years ago

Also there is a proof that with-replacement sampling performs worse than without-replacement sampling, but this is only valid for non-convex problems.

Side note: are you referring to section 4.2 of that paper? Or a different part? I'm not sure it applies in either case, but maybe I missed something in my cursory skim.

faroit commented 6 years ago

Side note: are you referring to section 4.2 of that paper? Or a different part? I'm not sure it applies in either case, but maybe I missed something in my cursory skim.

I got it from the conclusion 😆 I'll be honest here, I don't understand any of the proof and I will probably not have the time to look into it. But normally when this paper is cited, they will tell you that this only applies to convex problems.

faroit commented 6 years ago

The epoch-shuffling idea you posted is interesting, and similar to something the RR already does. A slightly more pescadorian approach to this would be to limit each streamer to produce a small number of samples (== the number before the order will change) and self-terminate, then use mode='permuted_cycle option of RoundRobinMux to re-order the streamers before reactivating. This way you don't have to wait for a full epoch before the order changes.

Excellent, thanks. This is the hack I was looking for. If this would be of greater interest, I would propose to add an option to RR to make this easy to use.

faroit commented 6 years ago

@bmcfee if you're interested how the RoundRobin flavors perform for source separation compared to StochasticMux, I can post the results here. Otherwise, feel free to close this out.

bmcfee commented 6 years ago

If this would be of greater interest, I would propose to add an option to RR to make this easy to use.

I think it is of interest, but I'm not sure adding it to RR is the best way to approach it. I think we all want to keep the logic as simple as possible, and adding top-down limiting control to RR (like in poissonmux) sounds like a can of worms to me.

However, if you find that it does the job for you, it'd be great to add it as a use-case/example in the documentation.

f0k commented 6 years ago

if you're interested how the RoundRobin flavors perform for source separation compared to StochasticMux, I can post the results here

@faroit: I'd be interested how much of a difference the sampling strategy makes. I used to sample excerpts without replacement (https://github.com/f0k/ismir2015/blob/master/experiments/augment.py#L18-L51), but now often sample recordings without replacement (and then uniformly sample an excerpt within each chosen recording). I wasn't worried too much that a batch could contain multiple excerpts of the same recording, but a) that for datasets of wildly varying recording lengths, longer recordings would be favored over shorter ones, and b) that for large datasets, keeping track of all possible recording/position tuples adds a lot of overhead.

bmcfee commented 6 years ago

now often sample recordings without replacement (and then uniformly sample an excerpt within each chosen recording).

This is typically how I do it as well, using a pumpp sampler for each track, and then a StochasticMux over tracks.

a) that for datasets of wildly varying recording lengths, longer recordings would be favored over shorter ones,

This is true, but you can also either use rate-limiting (in the mux) or self-limiting (in the streamer's generator) to avoid this. If you use single_active mode in the mux, then you still get without-replacement behavior in the short term, but the individual streamer can be reactivated at any time later. This should prevent long tracks from being over-represented, but still allow you to cover the data well.

b) that for large datasets, keeping track of all possible recording/position tuples adds a lot of overhead.

This is true. In general, we don't have a great mechanism for persisting state across activations of a streamer. This is somewhat by design, as you mention, it could incur substantial memory overhead. So far, it hasn't really been a problem, but I'm open to suggestions about how to generically support such a thing if it does become desirable.

faroit commented 6 years ago

@f0k @bmcfee just short status update: I didn't forget this issue. I still think this discussion could be really valuable. Also – specifically for source separation networks – we will present this as part of the ISMIR source tutorial. I will share the code and results later here when we fixed the remaining parameters.

sleglaive commented 5 years ago

Thanks all for this discussion and the related one #133.

@faroit did you end up with a solution for spectrogram sampling with pescador as part of the ISMIR source separation tutorial ?

faroit commented 5 years ago

@sleglaive yes, and no. I didn't use a pytorch dataloader, so I didn't get the speedup/caching benefit. Hence, I fully relied on pescadors batch sampler (which is significantly slower than the dataloader parallelization, even when zmq enabled), once the speed of pescador would be improved, this could make the sampling faster. Our code has a few tricks and will be released very soon.

sleglaive commented 5 years ago

Ok, thanks a lot for the update !

bmcfee commented 5 years ago

@faroit it's not surprising that pesc's latency is a bit worse than pytorch's data loader, as it's pure python and much more flexible. If you're feeling generous with time, it would be great if you could profile your implementation and let us know where the hotspots are -- there might be some easy optimizations that we could implement, but it's hard to know up front without specific case studies.

bmcfee commented 5 years ago

Pinging back on this thread -- @faroit did you ever do the comparison of round-robin and shuffle mux?

faroit commented 5 years ago

did you ever do the comparison of round-robin and shuffle mux?

we did. In the plot above, the track sampling without replacement was done with round-robin. For our music separation system (see slides) the difference (in validation loss) turned out to be negligible. In the meantime we upgraded code to stick with the default pytorch dataloaders for performance issues. I will have look at pescador again as mentioned in #144.

Feel free to close this issue – maybe add a one-liner to the docs to advertise round-robin a bit more?

bmcfee commented 5 years ago

Great, thanks! I'm not 100% sure I understand the axes of this plot though, can you break it down a bit more? Is the message that replacement for tracks doesn't matter, but it does for samples drawn per batch from the track set?

Side note: I'm currently doing a bit of issue gardening, and trying to clean up some rough edges before we start hammering on speed.

faroit commented 5 years ago

Great, thanks! I'm not 100% sure I understand the axes of this plot though, can you break it down a bit more? Is the message that replacement for tracks doesn't matter, but it does for samples drawn per batch from the track set?

sorry, of course: the message is that taking samples (here samples = chunks/excerpts within a track) without replacement improves performance as it guarantees that the model sees all parts of each tracks (when no overlap) exactly once, thus utilizing all the data. This is not concerning the batch but the epoch! In contrast, tracks with/without replacement is concerning the batch as mentioned above. Does this make sense?

bmcfee commented 5 years ago

Ah, thanks for the clarification! So the idea is that it doesn't matter so much how you pick the tracks to include in a batch, but sampling of excerpts from a track should be exhaustive? Did excerpts overlap in your experiment, or were they non-overlapping? (It doesn't really matter, but I'm curious.)

faroit commented 5 years ago

Did excerpts overlap in your experiment, or were they non-overlapping? (It doesn't really matter, but I'm curious.)

I think I always used non-overlapping segments. Here is the code I used back then to configure all options (rnd_track, rnd_ex (excerpt) and deterministic behavior for validation)

bmcfee commented 5 years ago

Great, thanks! I suspect there's something really interesting going on here relating to the conditional independence assumptions of excerpts within a track (definitely dependent) vs across tracks (hopefully independent).