Open justinsalamon opened 6 years ago
Thanks for opening this. There is a new Python library on point processes named tick which is quickly gaining traction, as it is fast, flexible and offers a sklearn-like API for parametric and nonparametric estimation of Hawkes processes. https://github.com/X-DataInitiative/tick tagging main contributors @dekken @mbompr
This paper by Dan Stowell modelizes inter-individual interactions between vocalizing birds (in cages) by means of a nonlinear GLMpp (generalized linear model point process). Apparently it does not fit BirdVox-full-night (migrating birds in flight) very well though. http://rsif.royalsocietypublishing.org/content/13/119/20160296
Thanks for the suggestion @lostanlen, this looks like a good option for simulating Poisson and Hawkes processes (for example) for the purpose of distributing sound events in time.
To start things off I'd like to first figure out what a high-level generator API should look like, starting with desired functionality and features.
To illustrate, right now events have to be added to the event spec one by one, along the lines of (excerpt from README example):
# Generate 1000 soundscapes using a truncated normal distribution of start times
for n in range(n_soundscapes):
# create a scaper
sc = scaper.Scaper(duration, fg_folder, bg_folder)
sc.protected_labels = []
sc.ref_db = ref_db
# add background
sc.add_background(label=('const', 'noise'),
source_file=('choose', []),
source_time=('const', 0))
# add random number of foreground events
n_events = np.random.randint(min_events, max_events+1)
for _ in range(n_events):
sc.add_event(label=('choose', []),
source_file=('choose', []),
source_time=(source_time_dist, source_time),
event_time=(event_time_dist, event_time_mean, event_time_std, event_time_min, event_time_max),
event_duration=(event_duration_dist, event_duration_min, event_duration_max),
snr=(snr_dist, snr_min, snr_max),
pitch_shift=(pitch_dist, pitch_min, pitch_max),
time_stretch=(time_stretch_dist, time_stretch_min, time_stretch_max))
In particular, the number of events to include has to be defined manually:
n_events = np.random.randint(min_events, max_events+1)
Furthermore, event parameters (start time, duration, snr, etc.) are sampled as IID, meaning it is not possible to specify constraints (e.g. "events can't overlap", "events must be separated by at least X seconds", "event times must follow a Hawkes process").
Given this, the high-level features I can think of that would be useful include:
But I can imagine there are other things I haven't thought of that would be useful here.
@lostanlen @Elizabeth-12324 @bmcfee @pseeth @mcartwright any suggestions? I'll drop a line to the DCASE list too in case anyone in the community has some suggestions.
Thanks!
Right. I suppose that this can be made available to the user by means of a higher-level method named sc.add_events
(note the plural), or perhaps better yet sc.add_foreground
.
Even if we don't have advanced point process modeling (à la Poisson / Hawkes) yet -- which would possibly require passing a pre-trained ModelHawkes
object from tick -- offering a guarantee that events are further apart than event_lag_min
would be very useful to @Elizabeth-12324. In BirdVox-full-night, we observed than almost all flight calls are apart by more than 100
ms from their left and right neighbors. If you want, I can work on a greedy method that adds events one by one according to a piecewise uniform distribution whose support is progressively covered by "gaps" (intervals of null probability) corresponding to the event_lag_min
vicinities of the events that are already in place.
In BirdVox we only care about the time lags between the center timestamps of events (that's where the flight calls are) but by default it might be preferable to be more conservative and define event lag as the difference between the event_start
of the future event and the event_stop
of the past event.
Another thing that is very important for BirdVox is to have a nonuniform distribution of labels. Ideally we'd like to pass a histogram of species occurence. It would also be good to be able to sample the acoustic diversity of the foreground, by means of a random variable n_labels
. Setting n_labels
to None
would imply that all labels are sampled independently, which is the current behavior. Setting it to a constant would imply that n_events
are sampled from n_labels
rather than all available labels. For example, in the context of BirdVox, setting n_labels=1
would enforce that every foreground has only one active species. Again, we could also randomize n_labels
with a Poisson random variable, a histogram, or even a truncated Gaussian.
The next level of abstraction is to model correlations between labels. E.g. I suppose jackhammer
correlates positively with drilling
but negatively with street_music
. I don't see an obvious way to model this without falling into combinatorial explosion (and therefore lack of robustness given the sample size), but this is probably useful to keep in mind.
Thanks @lostanlen, I think there are several great points in there.
For now I'd like to separate API design proposals from feature/functionality proposals, with the goal of first identifying the relevant feature set, and subsequently coming up with the most appropriate API design to support them.
Here's a summary of the feature suggestions made in your post (please correct if I missed anything):
tick
Does this cover everything? Some thoughts regarding these:
Re 1/2: (1) would be straight forward to implement, but I wonder whether it would be possible to implement (1) and (2) using the same API/tool as opposed to writing ad-hoc code for each. In particular, there might be other constraints we haven't thought of (e.g. on the allowed event overlap, or for example setting a minimum distance between specific label types, also related to (5)). So I think this point would merit some investigation to see whether there could be a single unified API/mechanism for supporting a broad range of temporal constraints.
Re 3: in principle this should be easy to implement. One option would be to allow the user to specify a probability mass distribution over the labels (e.g. in the form of a dict {honk: 0.5, siren: 0.2, ... }
) and sample labels accordingly. It might get trickier if we want this to interact with (4)
Re 4: n_labels=x
is one type of constraint, but I can think of other examples (e.g. never include labels a
and b
together in the same soundscape). So the question is whether we can provide a more general framework for defining label constraints?
Re 5: this one is tricky. Do you think something like a Markov chain would make sense here? Also, this would have interactions with (3) and (4).
Let me know what you think! Also, I think this is in the space of problems @bmcfee likes to tackle (e.g. label matching with constraints in mir_eval), so I wonder whether he has any comments on this?
Finally, since I imagine it'll take some time to identify features, design the API, and then implement (including tests and documentation), it's probably best if @lostanlen and @Elizabeth-12324 implemented quick ad-hoc solutions for the features you require for the BirdVox project in the immediate future.
Thanks for putting my random thoughts in order! :) @Elizabeth-12324 and myself just completed (1) and (3) in the context of BirdVox-scaper. We're going to make it into a separate repo for the scope of her internship. Then, there will be time to consider merging those contributions into scaper, possibly with some API adaptations.
Hawkes point process modeling (2) is allegedly a sledgehammer for solving 1, 3, and 5 at once. But its number of Hawkes convolutional kernels is quadratic in the number of labels, and every Hawkes kernel itself has several parameters. So that option is best reserved for a data-driven procedure, in which scaper aims at producing a "clone" of an existing dataset for which we already have strong annotation, rather than a data-agnostic synthesizer with user-defined controls.
You are right that it would be good to include @bmcfee for the discussion of (4) and (5), especially in cases where the purpose of scaper is to clone a weakly annotated dataset (for which we have label proportions and correlations, but not their associated timestamps) into a strongly annotated dataset.
To summarize, I could see three sorts of use case for scaper v1.x with x>0 (A) "Zero to strong". With a constraint satisfaction problem (B) "Weak to strong". WIth a Markov chain (C) "Strong to strong". With a multivariate point process
Thanks @lostanlen, this is great. Let's wait to see if anyone else chimes in, and subsequently move the discussion forward.
Summarizing offline discussion:
X
), the timing is random (from some process Z
), and there may be somewhat arbitrary constraints Y
.We talked about a couple of options, and it sounds like the most promising route is to use rejection sampling to implement the constraints. This would work by letting the user pass in X
and Z
, and a function reject
that implements Y
based on a (jams) annotation. The sampler would then propose a scene annotation a
. If reject(a) == False
, the audio is rendered and the scene is yielded to the user. If reject(a) == True
, it is rejected, and a new scene a
is sampled, and the process repeats.
Some caveats:
event_spacing(min_spacing=0.5)
returns a checker that fails if any two events have insufficient spacing). To make this more powerful, the API could allow a user to pass in multiple checkers, which all must pass to produce a sample. This should eliminate the need to write explicit jams-checking code in all but the stickiest of situations.Could the sampling of the audio scenes be driven by the distribution of the accepted scenes so far, with a bit of randomness thrown in there to make it more efficient? That way it isn't sampling audio scenes from the initial distribution that may not match the rejection function. You also don't have to explicitly define the distribution of scenes to sample from. It would maybe get learned from the rejection function. That process might converge quickly to a single type of audio scene, though. Just throwing out ideas, this might not work.
As far as the halting problem goes, maybe throw an error or warning if no scenes have been generated within a few minutes.
A bit off topic - I use rejection sampling for generating sound scenes already, but with very specific constraints. I have a fork of Scaper that generates audio scenes but also saves the generated sources. Sometimes the source audio files don't add up to the mixture (no idea why, maybe that's a bug...). I just toss the cases where that happens and resample. Happens like 5 times per 20k generated audio scenes when using UrbanSound as the data source.
Could the sampling of the audio scenes be driven by the distribution of the accepted scenes so far, with a bit of randomness thrown in there to make it more efficient?
By "scenes" do you mean soundscapes? That is, sampling a soundscape based on previously sampled soundscapes? Sounds tricky. It's not clear to me how this solves the rejection function matching issue? Anyway, in terms of halting, perhaps the cleanest option is to define n_attempts
and if that value is surpassed without successfully matching the condition the process halts.
Sometimes the source audio files don't add up to the mixture (no idea why, maybe that's a bug...).
5 per 20k sounds like a heisenbug O_O but impossible to say without going through your code. Also, does that still happen with v1.0.0rc1? between 0.1 and 0.2 I updated the LUFS calculation so that it happens after the sound source is trimmed to the desired duration, previously LUFS was computed on the entire source file prior to trimming. I wonder if that's the source of that issue (if it is it shouldn't happen in versions >=0.2.
Thanks @bmcfee for the great summary. Regarding the caveats you mention:
n_attempts
parameter and halt if it is surpassed. The onus is then on the user to specify constraints that are likely to be satisfied (and they can vary n_attempts
based on how insistent they are about the constraints).Yeah it seems like a heisenbug, hence the rejection sampling haha. I'll see if v1 fixes it once I merge my changes and write some tests! I should probably think more about the efficient rejection sampling, it was just something that came to mind immediately. The soundscapes that have been accepted so far should tell you something about how to create future soundscapes that are less likely to be rejected, but it could be hard to get that intuition to work out.
Something else that comes to mind - for music soundscape generation it's sometimes important that the generated soundscape is coherent - all sources start and end at the same time in their corresponding stem files before being mixed. Currently having to hack it - see this gist. It'd be nice if coherence was also something that could be specified in this high level API you're thinking about implementing. Not totally necessary though, as the logic in that hack works pretty well.
@pseeth Regarding the temporal coherence issue, it would be easy to implement as a constraint but almost impossible to achieve via rejection sampling :) I guess it would be easy to achieve as a temporal sampling process (where basically the process is choose a constant and stick to it for all events).
Regarding the heisenbug, let me know if it still happens once you update to v1.
Side note: for the source separation PR, the best would be to open the PR before you write any more code, so we can start discussing the API and desired functionality as soon as possible to avoid having to re-implement things. Doesn't matter if the tests aren't there yet.
In addition to the features listed in this thread, I would add a global constraint on the generation.
To maximize the usage of the raw materials, i.e. foreground and background files, it could be nice to avoid generating soundscapes with already used materials. To do so, a parameter to specify if the materials can be reused could be added in generate
.
Internally, a way to monitor and update the list of unused materials after each call of generate should be implemented.
Right now each event has to be explicitly added to the event specification (e.g. via for loop). It would be helpful to have high-level generators such that you'd only have to specify something along the lines of "generate a soundscape where the number of events is sampled from distribution X obeying temporal distribution Y with constraints Z".
This, in addition to simplifying some uses cases, would allow supporting non-iid event distributions, e.g. Hawkes (self-exciting) processes as suggested by @lostanlen
Related: cf. high-level knobs provided in SimScene (e.g. figure 1)