Tempogram ratio and f0 harmonic interpolation

bmcfee commented 2 years ago

Is your feature request related to a problem? Please describe.

This was alluded to in #1426 , but it would be handy to finally provide an implementation of the tempogram ratio feature from (Peeters, 2005). It would look something like the following (bottom subplot):

Peeters, Geoffroy. "Rhythm Classification Using Spectral Rhythm Patterns." ISMIR. 2005.

Describe the solution you'd like

The basic idea is to take a tempogram, extract a (time-varying) tempo estimate (corresponding to quarter-notes), and then use harmonic interpolation to measure tempogram energy for each frame at all musically important durations. The benefit of this over a raw tempogram is that it could locally normalize for tempo variation.

The underlying algorithm is somewhat similar to our interp_harmonics function, except that we want to pull out a different subset of frequencies for each frame. I imagine the implementation would use a vectorized interpolator in a similar fashion to what we do for reassigned spectrogram harmonics:

https://github.com/librosa/librosa/blob/7ac022a8496126e95b46a7d56ad880a328359bda/librosa/core/harmonic.py#L253-L265

but of course the details will be slightly different.

We could also support having a single, global tempo (much simpler), as well as aggregation over frames.

bmcfee commented 2 years ago

I hacked up a quick prototype using vectorized interpolation as described above:

kind = 'linear'
fill_value = 0

y, sr = librosa.load(librosa.ex('sweetwaltz'))

# Get the autocorrelation tempogram
tg = librosa.feature.tempogram(y=y, sr=sr)

# Get time-varying tempo estimates
tempi = librosa.beat.tempo(y=y, sr=sr, aggregate=None)

# Get the frequencies (BPM) for each tempogram bin
frequencies = librosa.tempo_frequencies(len(tg))

# Trim to ensure tempi and tg match
tg = np.abs(tg[..., :len(tempi)])

# Vectorize interpolation across time axis
# Even if the frequency set is constant for the tempogram,
# each frame could be asking for a different set of measurements
def _f_interp(tgc, f):

    # the isfinite here is to ignore the infinite frequency derived from lag=0 in AC-tempogram
    interp = scipy.interpolate.interp1d(frequencies[np.isfinite(frequencies)],
                                        tgc[np.isfinite(frequencies)],
                                        axis=0,
                                        bounds_error=False,
                                        copy=False,
                                        assume_sorted=False,
                                        kind=kind,
                                        fill_value=fill_value)
    return interp(f)

xfunc = np.vectorize(_f_interp, signature='(f),(h)->(h)')

# Factors from Peeters'05
# 
factors = [1/4, 1/3, 1/2, 2/3, 3/4, 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 3.75, 4]

tg = librosa.feature.tempogram(y=y, sr=sr)

tgr = xfunc(tg.swapaxes(-2, -1), np.multiply.outer(tempi, factors)).swapaxes(-2, -1)

After a bit of reflection on this idea, I realized that this core idea is also applicable to spectrograms as well as tempograms. In this case, it could be adapted to extract energy at harmonics of a chosen frequency (eg f0) for each frame, and essentially provide a kind of formant extractor or pitch-normalized (monophonic) timbre descriptor. (Aside: @dpwe does this idea ring any bells for you? It seems like something the speech community would have come up with ages ago.)

@lostanlen also pointed me at this recent work by @zafarrafii which has a similar flavor, though is more tuned to CQT-based representations and uses a deconvolution approach to pull out f0. There are also some slight differences in how the harmonics are derived, but overall I think it's a very similar idea. (@zafarrafii what do you think?)

The above is all to say that I think I'm convinced that we could get a lot of mileage out of a core function of the form:

librosa.core.harmonics.f0_harmonics(X, freqs, f0, harmonics, ....)

which is similar to interp_harmonics, except that rather than producing an output of shape … × N_harmonics × N_frequencies × N_frames we only compute the harmonics of the selected f0 for each frame, resulting in an output of shape … × N_harmonics × N_frames.

If we extract f0, say from pyin, this can be used to derive formant or timbre descriptors from any kind of spectrogram (linear, reassigned/time-varying, mel, cqt, vqt, iirt, etc).

If we extract "f0" from tempo estimation (either static or time-varying), then we can apply harmonic extractor to tempograms (either AC or fourier) to produce tempogram ratio features.

zafarrafii commented 2 years ago

I am trying to understand how your idea is different from interp_harmonics, I am not familiar with that function. What I can say is that, in my case, I was trying to deconvolve a log-spectrum into some sort of energy-normalized pitch component and a pitch-normalized energy component. Since that energy component is pitch-normalized, you don't need to estimate the f0 and you can find the energy of the harmonics easily then (works better in monophonic cases), hence the idea of using it to derive a simple timbre descriptor. Would you need to provide the f0 in your case then?

bmcfee commented 2 years ago

I am trying to understand how your idea is different from interp_harmonics, I am not familiar with that function.

That function is used for expanding a time-frequency representation to add a harmonics dimension (think HCQT). So we have frequencies × time → harmonics × frequencies × time. The proposed utility function could be thought of as slicing out a particular frequency at each time step, giving frequencies × time → harmonics × time.

Would you need to provide the f0 in your case then?

Yes - I would expect f0 to be estimated separately, analogous to your pitch normalization step. The motivation for separating out this functionality is that it could then be applied to both pitch and rhythm under a unified framework.

zafarrafii commented 2 years ago

Got it. That could be a nice addition, why not :)

bmcfee commented 2 years ago

Poring over some old discussions, I just realized that this proposed functionality could also be useful in some unexpected ways. If f0 is fixed to a tonal center frequency (over all time), and the "harmonics" are allowed to be fractions (intervals, no technical reason to forbid this), then we can do pitch salience histograms as well as described here: https://github.com/librosa/librosa/issues/641#issuecomment-636593736

bmcfee commented 1 year ago

Following up on this now that there's a stable implementation up for CR, and I wanted to clarify the relationship between f0_harmonics and CQHC. The key difference is that CQHC retains all frequencies above the fundamental, while f0h slices down to just those of interest (typically integer partials). It's possible to get something very close to CQHC by specifying harmonics as 2.0**(np.arange(n_bins) / bins_per_octave) and letting the interpolator fill in zeros when it passes above nyquist.

librosa / librosa

Tempogram ratio and f0 harmonic interpolation #1500