librosa / librosa

Python library for audio and music analysis
https://librosa.org/
ISC License
7.19k stars 966 forks source link

Salience function #297

Closed stefan-balke closed 8 years ago

stefan-balke commented 8 years ago

Hey there,

what about adding a salience function to librosa à la Justin Salamon for analysis tasks? One way to do it is with a Reassigned Spectrogram based on a instantaneous frequencies. Then add the subharmonics.

Testing material could come from Essentia.

Before collecting the stuff (I have everything somewhere), I wanted to ask if this is worth the work.

Best Stefan

bmcfee commented 8 years ago

Sure, that could be useful.

How about we page in @justinsalamon and @rabitt for advice on this?

justinsalamon commented 8 years ago

yeah sure why not :) There are some implementation details - you can do it all in matrix algebra but then it's not based on peak-picking, which generally gives a cleaner representation (the Essentia and Melodia implementations both use peak-picking). Sounds like a fun HAMR project I say.

stefan-balke commented 8 years ago

@justinsalamon, I'm not necessarily talking about predominant melody extraction. I am more interested in having: a) some spectrogram reassignment in librosa (as an analysis alternative to cqt) and b) another midlevel representation based on subharmonic summation. MELODIA could be built upon these features, of course, but I don't know if it is smart to have another implementation, since Essentia is doing the same but much faster.

justinsalamon commented 8 years ago

@stefan-balke sorry if I wasn't clear, I wasn't suggesting to implement melodia in librosa, just the salience function (which makes use of reassignment and subharmonic summation, but only for spectral peaks, which was the point I was trying to make)

stefan-balke commented 8 years ago

Ah okay, gotcha. I created a small notebook for a reassigned spec based on the instantaneous frequency distribution which is already in librosa: http://nbviewer.jupyter.org/github/stefan-balke/mpa-exc/blob/master/02_fourier_transform/reassigned_spectrogram.ipynb

@justinsalamon do you used specific unit test scenarios? And also, did you use the IFD by Abe or a phase vocoder?

bmcfee commented 8 years ago

Coming back to this after spending some time hacking at rhythm features, where it seems like there are actually some commonalities. Specifically, Prockup, 2015 uses (subharmonics) to amplify tempogram peaks.

@stefan-balke @justinsalamon is there a good reference for what exactly is meant by "salience function"?

I'm imagining a general-purpose (sub)harmonic enhancement function that can apply to stft, ifgram, cqt, or tempogram data. A crude first-pass implementation (for tempograms) is here. The algorithm is pretty simple:

  1. Build a spline interpolator from the input matrix. In general, the sampling points (ie frequencies) could be supplied by the user, so you could use fft_frequencies, cqt_frequencies, or ifgram output.
  2. For each element of the input, sum out its first K (sub-)harmonics. I imagine you probably want some specific weighting on this.

The trickiest part here would be supporting ifgram output, since each column will have a different frequency domain, and the standard fast interpolators (interp1d or RectBivariateSpline) won't work.

Thoughts?

stefan-balke commented 8 years ago

Hey,

my point of reference for salience function is:

.. [1] Justin Salamon, Emilia Gomez "Melody Extraction from Polyphonic Music
   Signals using Pitch Contour Characteristics",
   IEEE Transactions on Audio, Speech and Language Processing, 20(6):1759-1770, Aug. 2012.

Another interesting point could be Jonathan's dissertation (he is a PhD since yesterday): https://www.audiolabs-erlangen.de/content/05-fau/professor/00-mueller/01-students/2016_Driedger_AudioDecomposition_PhD-Thesis.pdf (Look at Appendix B.2)

I have some first implementation here:

def harmonic_summation(x_logspec):
    """Harmonic Summation of a spectrogram.

    The functions takes a log-Spectrogram and enhances it by summing up the the harmonics.
    To compute the new spectrogram, we use the parameters as suggested in [1]_.

    Parameters
    ----------
    x_logspec : ndarray
        Matrix with log-spectrogram. (Preferably a reassigned spectrogram.)
        Rows are the frequencies and columns are the time index.

    Returns
    -------
    x_sal : ndarray
        Enhanced spectrogram.

    References
    ----------
    .. [1] Justin Salamon, Emilia Gomez "Melody Extraction from Polyphonic Music
       Signals using Pitch Contour Characteristics",
       IEEE Transactions on Audio, Speech and Language Processing, 20(6):1759-1770, Aug. 2012.
    """
    # freq_range_start = 55
    # freq_range_end = 1760
    n_harmonics = 20
    weighting_alpha = 0.8
    magnitude_threshold = 40 #  in dB

    harmonic_offset = lambda harmonic: np.floor(np.log2(harmonic) * 1200/10)

    x_sal = np.zeros(x_logspec.shape)

    # Convolve log-spec with hann window in frequency direction
    x_logspec = ndimage.filters.convolve1d(x_logspec, np.hanning(10), axis=0, mode='constant')

    # Take logarithmic spectrogram and add it with a frequency-wise shifted version
    # of itself. As we have a bin-resolution of 10 cents, we shift the spectrum
    # by 120 to obtain the first harmonic etc.
    for cur_harmonic in np.arange(1, n_harmonics+1):
        # shift spectrogram
        cur_spec = np.zeros(x_logspec.shape)
        cur_offset = harmonic_offset(cur_harmonic)
        cur_spec[:-cur_offset-1, :] = x_logspec[cur_offset:-1, :]
        cur_spec_dB = 20*np.log(np.finfo(float).eps + cur_spec / np.max(cur_spec))

        # apply amplitude thresholding
        cur_spec[cur_spec_dB >= magnitude_threshold] = 0

        # add to salience
        x_sal += cur_spec * weighting_alpha**(cur_harmonic-1)

    return x_sal

So, I can't exactly follow your thoughts about the interpolation etc. In general, I would expect for my use-case of "frequency-salience" something like this:

  1. Calculate an STFT
  2. Reassign it to log-freq axis with ifgram (from Abe or Phase Vocoder*)
  3. Do the subharmonic summation à la @justinsalamon
  4. Be happy.

But I bet, @bmcfee has something in mind which is more general. Could you explain this is a little further for me, please?

*) Actually, I'm working on getting the ifgram with the Phase Vocoder...trying to reuse your time-stretching code, so we would have two methods to get the ifgram.

justinsalamon commented 8 years ago

So the term salience function (in the context of pitch analysis) appears in a bunch of places, one of the first papers I encountered (though not necessarily one of the first papers to use it) is:

[1] A. Klapuri. Multiple fundamental frequency estimation by summing harmonic amplitudes. In Proc. 7th International Conference on Music Information Retrieval, Victoria, Canada, October 2006.

Since then I've been using the term quite consistently, e.g.:

[1] J. Salamon. Melody Extraction from Polyphonic Music Signals. PhD thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2013.
[2] J. Salamon and E. G ́omez. Melody extraction from polyphonic music signals using pitch contour charac- teristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6):1759–1770, Aug. 2012.
[3] J. Salamon, E. G ́omez, and J. Bonada. Sinusoid extraction and salience function design for predominant melody estimation. In 14th Int. Conf. on Digital Audio Effects (DAFx-11), pages 73–80, Paris, France, Sep. 2011.
[4] J. Salamon, E. G ́omez, D. P. W. Ellis, and G. Richard. Melody extraction from polyphonic music signals: Approaches, applications and challenges. IEEE Signal Processing Magazine, 31(2):118–134, Mar. 2014.
[5] J. Salamon, S. Gulati, and X. Serra. A multipitch approach to tonic identification in indian classical music. In 13th Int. Soc. for Music Info. Retrieval Conf., pages 499–504, Porto, Portugal, Oct. 2012.
[6] J. Salamon, G. Peeters, and A. R ̈obel. Statistical characterisation of melodic pitch contours and its application for melody extraction. In 13th Int. Soc. for Music Info. Retrieval Conf., pages 187–192, Porto, Portugal, Oct. 2012.

And while I definitely won't go as far as saying that it's an established term in the community, its use is slowly expanding (in part due to [4]).

The definition provided in these papers for salience function is very broad, and can basically be summarized as: a time-frequency representations of pitch salience. That is, a salience function is not tied to any specific approach/implementation (e.g. subharmonic summation), and examples of salience functions that are not based on SHS include Goto's function that's obtained via tone model matching using EM or Durrieu's function that's obtained from a source/filter model (and also uses the term salience function in his papers):

[1] M. Goto. A real-time music-scene-description system: predominant-f0 estimation for detecting melody and bass lines in real-world audio signals. Speech Communication, 43:311–329, 2004.
[2] J.-L. Durrieu, B. David, and G. Richard. A musically motivated mid-level representation for pitch estimation and musical audio source separation. IEEE Journal on Selected Topics on Signal Processing, 5(6):1180–1191, Oct. 2011.

So if you asked me the defining features of a salience function would be:

My 2¢.

bmcfee commented 8 years ago

It's a time-frequency representation This representation has more "energy" at frequencies that correspond to a perceived pitch The key here is the term pitch, i.e., a salience function (unlike a spectrogram, constant-Q transform, etc.) only highlights perceived pitches, not harmonics. A salience function doesn't have to highlight all pitches present in the signal, it can, but it could also be tuned to highlight specific pitches (e.g. a salience function for melody extraction or bass line extraction)

I think this is getting to the spirit of my question: is there an underlying principle or computation that we can abstract and apply to many contexts (ie, not just pitch)?

The key component that seems most general is converting a time-freq matrix representation to a harmonic-time-freq tensor representation. (Those who see me in person are probably tired to death of hearing me rant about this, sorry :grin:.) The inputs here would be the TF representation, vector/matrix encoding the frequencies of each bin (ie as derived from fft_frequencies or ifgram or whatever), and the list of harmonics to extract, eg, range(-3, 4).

Given that output tensor, you can then filter / aggregate it however you like. For example, to keep only the frequencies with higher energy than their harmonics:

>>> S = np.abs(librosa.stft(y))
>>> S_harmonics = harmonics(S, librosa.fft_frequencies(2048, sr), range(-3, 4))
>>> S_harm_peaks = np.max(S_harmonics, axis=0)
>>> S_peak = S * (S >= S_harm_peaks)

the same could easily be applied to a CQT or a tempogram.

What this would buy you, over the snippet that @stefan-balke posted above, is that you can interpolate between bins for a more precise energy measurement, and not assume a specific frequency set. Having this bit abstracted out would also simplify things like perceptual loudness weighting, which could be implemented between harmonic extraction and aggregation by a simple broadcast-multiply op.

justinsalamon commented 8 years ago

I think this is getting to the spirit of my question: is there an underlying principle or computation that we can abstract and apply to many contexts (ie, not just pitch)?

I'm not sure - since this process depends on how the phenomenon you wish to highlight as salient is perceived. We know that the perception of pitch is (in part) motivated by the presence of harmonic structure. Not sure how that would translate to other phenomena.

The inputs here would be the TF representation, vector/matrix encoding the frequencies of each bin (ie as derived from fft_frequencies or ifgram or whatever), and the list of harmonics to extract, eg, range(-3, 4).

What do you mean by the "list of harmonics to extract"? Harmonic summation takes an f0 candidate and sums the energy at all spectral bins corresponding to harmonics above the candidate (with potential weighting, and some approaches only sum peaks, others sum the max value over a bin range centered on the expected harmonic location, etc.). Sub-harmonic summation works backwards- every spectral peak contributes energy to all possible f0 candidates of which the peak could be a harmonic (so f0/2, f0/3, f0/4, f0/5, etc.).

bmcfee commented 8 years ago

We know that the perception of pitch is (in part) motivated by the presence of harmonic structure. Not sure how that would translate to other phenomena.

Imagine an onset autocorrelation curve: it also has harmonics at multiples/subdivisions of the tempo.

What do you mean by the "list of harmonics to extract"?

Exactly what it sounds like: which multiples of f0 do you care about?

Harmonic summation takes an f0 candidate and sums the energy at all spectral bins corresponding to harmonics above the candidate (with potential weighting, and some approaches only sum peaks, others sum the max value over a bin range centered on the expected harmonic location, etc.).

In general you don't know f0. I'm thinking of a function which would treat every tf bin as a potential f0 and apply the corresponding summation / weighting / whatever. The first step is just estimating the energy at the relevant harmonics; if you want to do some localmax, binning, smoothing, etc, that's a second step. Way down the line is actually picking out a specific f0 of interest.

justinsalamon commented 8 years ago

The first step is just estimating the energy at the relevant harmonics; if you want to do some localmax, binning, smoothing, etc, that's a second step. Way down the line is actually picking out a specific f0 of interest.

Yes, but there are different ways to estimate the energy at relevant harmonics. For instance, you can treat every tf bin as a potential f0 and sum only the energy at the precise bin locations of the harmonics. Or, you could estimate the harmonic bin locations but then take the local max for each harmonic bin over some bandwidth (that's what Klapuri did in that paper). Or you could only sum the energy if there's a spectral peak in the expected frequency bandwidth centered on the harmonic bin. All three are different flavors of harmonic summation.

But you could, alternatively, apply sub-harmonic summation - in this case you don't treat every tf bin as a potential f0: you first pick spectral peaks and then using only those peaks you treat each peak as a potential harmonic and "sum it into" all tf-bins of which the peak could be a harmonic (that's what happens in Melodia for instance), where rather than mapping only to the precise bins f0/2, f0/3, f0/4, etc., you actually place some kernel (e.g. a gaussian or hann window) centered on the bin location, to account for slight inharmonicities.

Basically what I'm saying is that regardless of the post-processing you apply, there are various variants of the summation strategy itself to choose from.

bmcfee commented 8 years ago

@justinsalamon I don't think we're on the same page here.

If I understand correctly, you're describing ways of filtering, smoothing, and aggregating energy at different harmonics.

I'm talking about something different: a function that maps a spectrogram S and list of harmonics h to an array of spectrograms Sh where Sh[k] is just S estimated at the h[k] harmonic.

The point of this is to separate the logic for estimating energy at harmonically related frequencies from a specific time-frequency representation. That way, the same method can be applied equally well to linear (stft), logarithmic (cqt), inverse (tempogram), or non-uniform (ifgram) frequency spaces.

This gist demonstrates a first hack at this (the harmonics function). The applications of the method in the notebook need a lot of work, but hopefully this illustrates the separation of functionality that I'm aiming for here.

justinsalamon commented 8 years ago

@justinsalamon I don't think we're on the same page here.

Neither do I :) let's discuss in person on Tuesday, I have a hunch that'll be quicker. We can post a summary here for posterity here once we've synced up.

bmcfee commented 8 years ago

Neither do I :) let's discuss in person on Tuesday, I have a hunch that'll be quicker.

Sure. In hindsight I realized that I've totally derailed this thread from salience into harmonic estimation ¯(ツ)

justinsalamon commented 8 years ago

Following our discussion today, this thought came up - the basic approach for matching harmonics is by creating a tensor where each channel represents a transformed version of the input representation, e.g in the case of a spectrogram it would be "compressed", a constant-Q would be shifted. For the latter case, couldn't this frequency "compression" result in a detrimental loss of resolution, especially in the low frequency range?

To illustrate: say you have an f0 candidate at 50 Hz (and harmonics at 100, 150, etc.). On a linear frequency axis, assuming your frequency resolution is sufficient, you are able to distinguish between these harmonic peaks. However, after compressing the frequency axis (such that the second harmonic in the second channel is now aligned with the first harmonic in the first channel, if I understand correctly) you might end up with a representation that doesn't have sufficient resolution in the low-frequency range to distinguish between the harmonics, especially in the presence of other pitched sources.

To make a long story short - I'm not sure the process of creating the tensor (from a spectrogram) is equivalent to a salience function that operates on a linear frequency scale of the input representation.

bmcfee commented 8 years ago

Following our discussion today, this thought came up - the basic approach for matching harmonics is by creating a tensor where each channel represents a transformed version of the input representation, e.g in the case of a spectrogram it would be "compressed", a constant-Q would be shifted. For the latter case, couldn't this frequency "compression" result in a detrimental loss of resolution, especially in the low frequency range?

The semantics of the harmonic tensor are similar to shifting/compressing the spectrogram, but they're not equivalent. All we're doing is making a representation where Sh[i, j, k] measures the i'th harmonic of the j'th frequency at time k. This measurement can be estimated in a variety of ways, but ultimately it all comes down to some form of interpolation. Nobody is claiming that the i'th harmonic plane has a linear frequency axis (or any other kind of scale, beyond what's implied by being harmonically related to the input scale).

To make a long story short - I'm not sure the process of creating the tensor (from a spectrogram) is equivalent to a salience function that operates on a linear frequency scale of the input representation.

No, it is not equivalent. But the frequency scale really shouldn't matter here; if you want linear or logarithmic or inverse, harmonic estimation

bmcfee commented 8 years ago

Chatted with @stefan-balke about this yesterday, and looking back into it, my initial gist has a few subtle bugs in it.

Unless there are objections, I'll try to hammer these out and get the harmonic interpolator merged first. We can then look into building a salience map on top of it, among other things (tempo estimation, f0, etc).

bmcfee commented 8 years ago

405 is merged now. Anyone want to take a crack at implementing a generic salience function?

rabitt commented 8 years ago

@bmcfee I'll give it a go