justinsalamon / scaper

A library for soundscape synthesis and augmentation
BSD 3-Clause "New" or "Revised" License
383 stars 56 forks source link

High-level soundscape generators #35

Open justinsalamon opened 6 years ago

justinsalamon commented 6 years ago

Right now each event has to be explicitly added to the event specification (e.g. via for loop). It would be helpful to have high-level generators such that you'd only have to specify something along the lines of "generate a soundscape where the number of events is sampled from distribution X obeying temporal distribution Y with constraints Z".

This, in addition to simplifying some uses cases, would allow supporting non-iid event distributions, e.g. Hawkes (self-exciting) processes as suggested by @lostanlen

Related: cf. high-level knobs provided in SimScene (e.g. figure 1)

lostanlen commented 6 years ago

Thanks for opening this. There is a new Python library on point processes named tick which is quickly gaining traction, as it is fast, flexible and offers a sklearn-like API for parametric and nonparametric estimation of Hawkes processes. https://github.com/X-DataInitiative/tick tagging main contributors @dekken @mbompr

This paper by Dan Stowell modelizes inter-individual interactions between vocalizing birds (in cages) by means of a nonlinear GLMpp (generalized linear model point process). Apparently it does not fit BirdVox-full-night (migrating birds in flight) very well though. http://rsif.royalsocietypublishing.org/content/13/119/20160296

justinsalamon commented 6 years ago

Thanks for the suggestion @lostanlen, this looks like a good option for simulating Poisson and Hawkes processes (for example) for the purpose of distributing sound events in time.

justinsalamon commented 6 years ago

To start things off I'd like to first figure out what a high-level generator API should look like, starting with desired functionality and features.

To illustrate, right now events have to be added to the event spec one by one, along the lines of (excerpt from README example):

# Generate 1000 soundscapes using a truncated normal distribution of start times
for n in range(n_soundscapes):    
    # create a scaper
    sc = scaper.Scaper(duration, fg_folder, bg_folder)
    sc.protected_labels = []
    sc.ref_db = ref_db

    # add background
    sc.add_background(label=('const', 'noise'), 
                      source_file=('choose', []), 
                      source_time=('const', 0))

    # add random number of foreground events
    n_events = np.random.randint(min_events, max_events+1)
    for _ in range(n_events):
        sc.add_event(label=('choose', []), 
                     source_file=('choose', []), 
                     source_time=(source_time_dist, source_time), 
                     event_time=(event_time_dist, event_time_mean, event_time_std, event_time_min, event_time_max), 
                     event_duration=(event_duration_dist, event_duration_min, event_duration_max), 
                     snr=(snr_dist, snr_min, snr_max),
                     pitch_shift=(pitch_dist, pitch_min, pitch_max),
                     time_stretch=(time_stretch_dist, time_stretch_min, time_stretch_max))

In particular, the number of events to include has to be defined manually:

n_events = np.random.randint(min_events, max_events+1)

Furthermore, event parameters (start time, duration, snr, etc.) are sampled as IID, meaning it is not possible to specify constraints (e.g. "events can't overlap", "events must be separated by at least X seconds", "event times must follow a Hawkes process").

Given this, the high-level features I can think of that would be useful include:

But I can imagine there are other things I haven't thought of that would be useful here.

@lostanlen @Elizabeth-12324 @bmcfee @pseeth @mcartwright any suggestions? I'll drop a line to the DCASE list too in case anyone in the community has some suggestions.

Thanks!

lostanlen commented 6 years ago

Right. I suppose that this can be made available to the user by means of a higher-level method named sc.add_events (note the plural), or perhaps better yet sc.add_foreground. Even if we don't have advanced point process modeling (à la Poisson / Hawkes) yet -- which would possibly require passing a pre-trained ModelHawkes object from tick -- offering a guarantee that events are further apart than event_lag_min would be very useful to @Elizabeth-12324. In BirdVox-full-night, we observed than almost all flight calls are apart by more than 100 ms from their left and right neighbors. If you want, I can work on a greedy method that adds events one by one according to a piecewise uniform distribution whose support is progressively covered by "gaps" (intervals of null probability) corresponding to the event_lag_min vicinities of the events that are already in place. In BirdVox we only care about the time lags between the center timestamps of events (that's where the flight calls are) but by default it might be preferable to be more conservative and define event lag as the difference between the event_start of the future event and the event_stop of the past event.

Another thing that is very important for BirdVox is to have a nonuniform distribution of labels. Ideally we'd like to pass a histogram of species occurence. It would also be good to be able to sample the acoustic diversity of the foreground, by means of a random variable n_labels. Setting n_labels to None would imply that all labels are sampled independently, which is the current behavior. Setting it to a constant would imply that n_events are sampled from n_labels rather than all available labels. For example, in the context of BirdVox, setting n_labels=1 would enforce that every foreground has only one active species. Again, we could also randomize n_labels with a Poisson random variable, a histogram, or even a truncated Gaussian.

The next level of abstraction is to model correlations between labels. E.g. I suppose jackhammer correlates positively with drilling but negatively with street_music. I don't see an obvious way to model this without falling into combinatorial explosion (and therefore lack of robustness given the sample size), but this is probably useful to keep in mind.

justinsalamon commented 6 years ago

Thanks @lostanlen, I think there are several great points in there.

For now I'd like to separate API design proposals from feature/functionality proposals, with the goal of first identifying the relevant feature set, and subsequently coming up with the most appropriate API design to support them.

Here's a summary of the feature suggestions made in your post (please correct if I missed anything):

  1. Simple constraint on event times: set minimum distance between events
  2. Complex temporal constraints on event times: potentially via tick
  3. Non-uniform label distributions (currently only uniform is supported)
  4. Constraints on label selection (e.g. limit the number of allowed labels)
  5. Model correlation between labels

Does this cover everything? Some thoughts regarding these:

Re 1/2: (1) would be straight forward to implement, but I wonder whether it would be possible to implement (1) and (2) using the same API/tool as opposed to writing ad-hoc code for each. In particular, there might be other constraints we haven't thought of (e.g. on the allowed event overlap, or for example setting a minimum distance between specific label types, also related to (5)). So I think this point would merit some investigation to see whether there could be a single unified API/mechanism for supporting a broad range of temporal constraints.

Re 3: in principle this should be easy to implement. One option would be to allow the user to specify a probability mass distribution over the labels (e.g. in the form of a dict {honk: 0.5, siren: 0.2, ... }) and sample labels accordingly. It might get trickier if we want this to interact with (4)

Re 4: n_labels=x is one type of constraint, but I can think of other examples (e.g. never include labels a and b together in the same soundscape). So the question is whether we can provide a more general framework for defining label constraints?

Re 5: this one is tricky. Do you think something like a Markov chain would make sense here? Also, this would have interactions with (3) and (4).

Let me know what you think! Also, I think this is in the space of problems @bmcfee likes to tackle (e.g. label matching with constraints in mir_eval), so I wonder whether he has any comments on this?

Finally, since I imagine it'll take some time to identify features, design the API, and then implement (including tests and documentation), it's probably best if @lostanlen and @Elizabeth-12324 implemented quick ad-hoc solutions for the features you require for the BirdVox project in the immediate future.

lostanlen commented 6 years ago

Thanks for putting my random thoughts in order! :) @Elizabeth-12324 and myself just completed (1) and (3) in the context of BirdVox-scaper. We're going to make it into a separate repo for the scope of her internship. Then, there will be time to consider merging those contributions into scaper, possibly with some API adaptations.

Hawkes point process modeling (2) is allegedly a sledgehammer for solving 1, 3, and 5 at once. But its number of Hawkes convolutional kernels is quadratic in the number of labels, and every Hawkes kernel itself has several parameters. So that option is best reserved for a data-driven procedure, in which scaper aims at producing a "clone" of an existing dataset for which we already have strong annotation, rather than a data-agnostic synthesizer with user-defined controls.

You are right that it would be good to include @bmcfee for the discussion of (4) and (5), especially in cases where the purpose of scaper is to clone a weakly annotated dataset (for which we have label proportions and correlations, but not their associated timestamps) into a strongly annotated dataset.

lostanlen commented 6 years ago

To summarize, I could see three sorts of use case for scaper v1.x with x>0 (A) "Zero to strong". With a constraint satisfaction problem (B) "Weak to strong". WIth a Markov chain (C) "Strong to strong". With a multivariate point process

justinsalamon commented 6 years ago

Thanks @lostanlen, this is great. Let's wait to see if anyone else chimes in, and subsequently move the discussion forward.

bmcfee commented 6 years ago

Summarizing offline discussion:

  1. The goal is to make it easy to sample from a distribution of sound scenes, where the number of events is random (from distribution X), the timing is random (from some process Z), and there may be somewhat arbitrary constraints Y.
  2. Without constraints, the interface for this sort of thing should be pretty simple. The whole problem is how you expose the constraints to the user.

We talked about a couple of options, and it sounds like the most promising route is to use rejection sampling to implement the constraints. This would work by letting the user pass in X and Z, and a function reject that implements Y based on a (jams) annotation. The sampler would then propose a scene annotation a. If reject(a) == False, the audio is rendered and the scene is yielded to the user. If reject(a) == True, it is rejected, and a new scene a is sampled, and the process repeats.

Some caveats:

  1. Rejection sampling is extremely inefficient, and using a python function to implement the rejection logic makes is impossible (in the halting problem sense) to determine a priori whether any samples will be generated at all.
  2. Users will probably not want to implement rejection functions. Instead, we can provide some checker constructors for the most common cases (eg event_spacing(min_spacing=0.5) returns a checker that fails if any two events have insufficient spacing). To make this more powerful, the API could allow a user to pass in multiple checkers, which all must pass to produce a sample. This should eliminate the need to write explicit jams-checking code in all but the stickiest of situations.
  3. I'm not sure this is what you want to do for modeling label frequency / co-occurrence though, since rejection sampling will become exponentially inefficient with the number of labels / entropy of the target distributions. You might want to provide some explicit functionality to control label sampling, and then only use rejection on the timing constraints. That said, I'm not sure how you would want to implement that part of it -- some kind of entrofy-like procedure? Sounds difficult...
pseeth commented 6 years ago

Could the sampling of the audio scenes be driven by the distribution of the accepted scenes so far, with a bit of randomness thrown in there to make it more efficient? That way it isn't sampling audio scenes from the initial distribution that may not match the rejection function. You also don't have to explicitly define the distribution of scenes to sample from. It would maybe get learned from the rejection function. That process might converge quickly to a single type of audio scene, though. Just throwing out ideas, this might not work.

As far as the halting problem goes, maybe throw an error or warning if no scenes have been generated within a few minutes.

A bit off topic - I use rejection sampling for generating sound scenes already, but with very specific constraints. I have a fork of Scaper that generates audio scenes but also saves the generated sources. Sometimes the source audio files don't add up to the mixture (no idea why, maybe that's a bug...). I just toss the cases where that happens and resample. Happens like 5 times per 20k generated audio scenes when using UrbanSound as the data source.

justinsalamon commented 6 years ago

Could the sampling of the audio scenes be driven by the distribution of the accepted scenes so far, with a bit of randomness thrown in there to make it more efficient?

By "scenes" do you mean soundscapes? That is, sampling a soundscape based on previously sampled soundscapes? Sounds tricky. It's not clear to me how this solves the rejection function matching issue? Anyway, in terms of halting, perhaps the cleanest option is to define n_attempts and if that value is surpassed without successfully matching the condition the process halts.

Sometimes the source audio files don't add up to the mixture (no idea why, maybe that's a bug...).

5 per 20k sounds like a heisenbug O_O but impossible to say without going through your code. Also, does that still happen with v1.0.0rc1? between 0.1 and 0.2 I updated the LUFS calculation so that it happens after the sound source is trimmed to the desired duration, previously LUFS was computed on the entire source file prior to trimming. I wonder if that's the source of that issue (if it is it shouldn't happen in versions >=0.2.

justinsalamon commented 6 years ago

Thanks @bmcfee for the great summary. Regarding the caveats you mention:

  1. I think a solution could be (as noted above) to set an n_attempts parameter and halt if it is surpassed. The onus is then on the user to specify constraints that are likely to be satisfied (and they can vary n_attempts based on how insistent they are about the constraints).
  2. Yes.
  3. I like the idea of separating the label sampling from the temporal sampling. The only caveat I can see to this is that it would not allow for something like "a frog call is often followed by a bird call". Basically, there are scenarios where label sampling is also a process (could be modeled by a Markov chain for example). Not sure how to reconcile label processes and label constraints, though.
pseeth commented 6 years ago

Yeah it seems like a heisenbug, hence the rejection sampling haha. I'll see if v1 fixes it once I merge my changes and write some tests! I should probably think more about the efficient rejection sampling, it was just something that came to mind immediately. The soundscapes that have been accepted so far should tell you something about how to create future soundscapes that are less likely to be rejected, but it could be hard to get that intuition to work out.

Something else that comes to mind - for music soundscape generation it's sometimes important that the generated soundscape is coherent - all sources start and end at the same time in their corresponding stem files before being mixed. Currently having to hack it - see this gist. It'd be nice if coherence was also something that could be specified in this high level API you're thinking about implementing. Not totally necessary though, as the logic in that hack works pretty well.

justinsalamon commented 6 years ago

@pseeth Regarding the temporal coherence issue, it would be easy to implement as a constraint but almost impossible to achieve via rejection sampling :) I guess it would be easy to achieve as a temporal sampling process (where basically the process is choose a constant and stick to it for all events).

Regarding the heisenbug, let me know if it still happens once you update to v1.

Side note: for the source separation PR, the best would be to open the PR before you write any more code, so we can start discussing the API and desired functionality as soon as possible to avoid having to re-implement things. Doesn't matter if the tests aren't there yet.

JorisCos commented 4 years ago

In addition to the features listed in this thread, I would add a global constraint on the generation. To maximize the usage of the raw materials, i.e. foreground and background files, it could be nice to avoid generating soundscapes with already used materials. To do so, a parameter to specify if the materials can be reused could be added in generate. Internally, a way to monitor and update the list of unused materials after each call of generate should be implemented.