Closed stefan-balke closed 8 years ago
Sure, that could be useful.
How about we page in @justinsalamon and @rabitt for advice on this?
yeah sure why not :) There are some implementation details - you can do it all in matrix algebra but then it's not based on peak-picking, which generally gives a cleaner representation (the Essentia and Melodia implementations both use peak-picking). Sounds like a fun HAMR project I say.
@justinsalamon, I'm not necessarily talking about predominant melody extraction. I am more interested in having: a) some spectrogram reassignment in librosa (as an analysis alternative to cqt) and b) another midlevel representation based on subharmonic summation. MELODIA could be built upon these features, of course, but I don't know if it is smart to have another implementation, since Essentia is doing the same but much faster.
@stefan-balke sorry if I wasn't clear, I wasn't suggesting to implement melodia in librosa, just the salience function (which makes use of reassignment and subharmonic summation, but only for spectral peaks, which was the point I was trying to make)
Ah okay, gotcha. I created a small notebook for a reassigned spec based on the instantaneous frequency distribution which is already in librosa: http://nbviewer.jupyter.org/github/stefan-balke/mpa-exc/blob/master/02_fourier_transform/reassigned_spectrogram.ipynb
@justinsalamon do you used specific unit test scenarios? And also, did you use the IFD by Abe or a phase vocoder?
Coming back to this after spending some time hacking at rhythm features, where it seems like there are actually some commonalities. Specifically, Prockup, 2015 uses (subharmonics) to amplify tempogram peaks.
@stefan-balke @justinsalamon is there a good reference for what exactly is meant by "salience function"?
I'm imagining a general-purpose (sub)harmonic enhancement function that can apply to stft, ifgram, cqt, or tempogram data. A crude first-pass implementation (for tempograms) is here. The algorithm is pretty simple:
fft_frequencies
, cqt_frequencies
, or ifgram
output.The trickiest part here would be supporting ifgram
output, since each column will have a different frequency domain, and the standard fast interpolators (interp1d or RectBivariateSpline) won't work.
Thoughts?
Hey,
my point of reference for salience function is:
.. [1] Justin Salamon, Emilia Gomez "Melody Extraction from Polyphonic Music Signals using Pitch Contour Characteristics", IEEE Transactions on Audio, Speech and Language Processing, 20(6):1759-1770, Aug. 2012.
Another interesting point could be Jonathan's dissertation (he is a PhD since yesterday): https://www.audiolabs-erlangen.de/content/05-fau/professor/00-mueller/01-students/2016_Driedger_AudioDecomposition_PhD-Thesis.pdf (Look at Appendix B.2)
I have some first implementation here:
def harmonic_summation(x_logspec):
"""Harmonic Summation of a spectrogram.
The functions takes a log-Spectrogram and enhances it by summing up the the harmonics.
To compute the new spectrogram, we use the parameters as suggested in [1]_.
Parameters
----------
x_logspec : ndarray
Matrix with log-spectrogram. (Preferably a reassigned spectrogram.)
Rows are the frequencies and columns are the time index.
Returns
-------
x_sal : ndarray
Enhanced spectrogram.
References
----------
.. [1] Justin Salamon, Emilia Gomez "Melody Extraction from Polyphonic Music
Signals using Pitch Contour Characteristics",
IEEE Transactions on Audio, Speech and Language Processing, 20(6):1759-1770, Aug. 2012.
"""
# freq_range_start = 55
# freq_range_end = 1760
n_harmonics = 20
weighting_alpha = 0.8
magnitude_threshold = 40 # in dB
harmonic_offset = lambda harmonic: np.floor(np.log2(harmonic) * 1200/10)
x_sal = np.zeros(x_logspec.shape)
# Convolve log-spec with hann window in frequency direction
x_logspec = ndimage.filters.convolve1d(x_logspec, np.hanning(10), axis=0, mode='constant')
# Take logarithmic spectrogram and add it with a frequency-wise shifted version
# of itself. As we have a bin-resolution of 10 cents, we shift the spectrum
# by 120 to obtain the first harmonic etc.
for cur_harmonic in np.arange(1, n_harmonics+1):
# shift spectrogram
cur_spec = np.zeros(x_logspec.shape)
cur_offset = harmonic_offset(cur_harmonic)
cur_spec[:-cur_offset-1, :] = x_logspec[cur_offset:-1, :]
cur_spec_dB = 20*np.log(np.finfo(float).eps + cur_spec / np.max(cur_spec))
# apply amplitude thresholding
cur_spec[cur_spec_dB >= magnitude_threshold] = 0
# add to salience
x_sal += cur_spec * weighting_alpha**(cur_harmonic-1)
return x_sal
So, I can't exactly follow your thoughts about the interpolation etc. In general, I would expect for my use-case of "frequency-salience" something like this:
But I bet, @bmcfee has something in mind which is more general. Could you explain this is a little further for me, please?
*) Actually, I'm working on getting the ifgram with the Phase Vocoder...trying to reuse your time-stretching code, so we would have two methods to get the ifgram.
So the term salience function (in the context of pitch analysis) appears in a bunch of places, one of the first papers I encountered (though not necessarily one of the first papers to use it) is:
[1] A. Klapuri. Multiple fundamental frequency estimation by summing harmonic amplitudes. In Proc. 7th International Conference on Music Information Retrieval, Victoria, Canada, October 2006.
Since then I've been using the term quite consistently, e.g.:
[1] J. Salamon. Melody Extraction from Polyphonic Music Signals. PhD thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2013.
[2] J. Salamon and E. G ́omez. Melody extraction from polyphonic music signals using pitch contour charac- teristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6):1759–1770, Aug. 2012.
[3] J. Salamon, E. G ́omez, and J. Bonada. Sinusoid extraction and salience function design for predominant melody estimation. In 14th Int. Conf. on Digital Audio Effects (DAFx-11), pages 73–80, Paris, France, Sep. 2011.
[4] J. Salamon, E. G ́omez, D. P. W. Ellis, and G. Richard. Melody extraction from polyphonic music signals: Approaches, applications and challenges. IEEE Signal Processing Magazine, 31(2):118–134, Mar. 2014.
[5] J. Salamon, S. Gulati, and X. Serra. A multipitch approach to tonic identification in indian classical music. In 13th Int. Soc. for Music Info. Retrieval Conf., pages 499–504, Porto, Portugal, Oct. 2012.
[6] J. Salamon, G. Peeters, and A. R ̈obel. Statistical characterisation of melodic pitch contours and its application for melody extraction. In 13th Int. Soc. for Music Info. Retrieval Conf., pages 187–192, Porto, Portugal, Oct. 2012.
And while I definitely won't go as far as saying that it's an established term in the community, its use is slowly expanding (in part due to [4]).
The definition provided in these papers for salience function is very broad, and can basically be summarized as: a time-frequency representations of pitch salience. That is, a salience function is not tied to any specific approach/implementation (e.g. subharmonic summation), and examples of salience functions that are not based on SHS include Goto's function that's obtained via tone model matching using EM or Durrieu's function that's obtained from a source/filter model (and also uses the term salience function in his papers):
[1] M. Goto. A real-time music-scene-description system: predominant-f0 estimation for detecting melody and bass lines in real-world audio signals. Speech Communication, 43:311–329, 2004.
[2] J.-L. Durrieu, B. David, and G. Richard. A musically motivated mid-level representation for pitch estimation and musical audio source separation. IEEE Journal on Selected Topics on Signal Processing, 5(6):1180–1191, Oct. 2011.
So if you asked me the defining features of a salience function would be:
My 2¢.
It's a time-frequency representation This representation has more "energy" at frequencies that correspond to a perceived pitch The key here is the term pitch, i.e., a salience function (unlike a spectrogram, constant-Q transform, etc.) only highlights perceived pitches, not harmonics. A salience function doesn't have to highlight all pitches present in the signal, it can, but it could also be tuned to highlight specific pitches (e.g. a salience function for melody extraction or bass line extraction)
I think this is getting to the spirit of my question: is there an underlying principle or computation that we can abstract and apply to many contexts (ie, not just pitch)?
The key component that seems most general is converting a time-freq matrix representation to a harmonic-time-freq tensor representation. (Those who see me in person are probably tired to death of hearing me rant about this, sorry :grin:.) The inputs here would be the TF representation, vector/matrix encoding the frequencies of each bin (ie as derived from fft_frequencies
or ifgram
or whatever), and the list of harmonics to extract, eg, range(-3, 4)
.
Given that output tensor, you can then filter / aggregate it however you like. For example, to keep only the frequencies with higher energy than their harmonics:
>>> S = np.abs(librosa.stft(y))
>>> S_harmonics = harmonics(S, librosa.fft_frequencies(2048, sr), range(-3, 4))
>>> S_harm_peaks = np.max(S_harmonics, axis=0)
>>> S_peak = S * (S >= S_harm_peaks)
the same could easily be applied to a CQT or a tempogram.
What this would buy you, over the snippet that @stefan-balke posted above, is that you can interpolate between bins for a more precise energy measurement, and not assume a specific frequency set. Having this bit abstracted out would also simplify things like perceptual loudness weighting, which could be implemented between harmonic extraction and aggregation by a simple broadcast-multiply op.
I think this is getting to the spirit of my question: is there an underlying principle or computation that we can abstract and apply to many contexts (ie, not just pitch)?
I'm not sure - since this process depends on how the phenomenon you wish to highlight as salient is perceived. We know that the perception of pitch is (in part) motivated by the presence of harmonic structure. Not sure how that would translate to other phenomena.
The inputs here would be the TF representation, vector/matrix encoding the frequencies of each bin (ie as derived from fft_frequencies or ifgram or whatever), and the list of harmonics to extract, eg, range(-3, 4).
What do you mean by the "list of harmonics to extract"? Harmonic summation takes an f0 candidate and sums the energy at all spectral bins corresponding to harmonics above the candidate (with potential weighting, and some approaches only sum peaks, others sum the max value over a bin range centered on the expected harmonic location, etc.). Sub-harmonic summation works backwards- every spectral peak contributes energy to all possible f0 candidates of which the peak could be a harmonic (so f0/2, f0/3, f0/4, f0/5, etc.).
We know that the perception of pitch is (in part) motivated by the presence of harmonic structure. Not sure how that would translate to other phenomena.
Imagine an onset autocorrelation curve: it also has harmonics at multiples/subdivisions of the tempo.
What do you mean by the "list of harmonics to extract"?
Exactly what it sounds like: which multiples of f0 do you care about?
Harmonic summation takes an f0 candidate and sums the energy at all spectral bins corresponding to harmonics above the candidate (with potential weighting, and some approaches only sum peaks, others sum the max value over a bin range centered on the expected harmonic location, etc.).
In general you don't know f0. I'm thinking of a function which would treat every tf bin as a potential f0 and apply the corresponding summation / weighting / whatever. The first step is just estimating the energy at the relevant harmonics; if you want to do some localmax, binning, smoothing, etc, that's a second step. Way down the line is actually picking out a specific f0 of interest.
The first step is just estimating the energy at the relevant harmonics; if you want to do some localmax, binning, smoothing, etc, that's a second step. Way down the line is actually picking out a specific f0 of interest.
Yes, but there are different ways to estimate the energy at relevant harmonics. For instance, you can treat every tf bin as a potential f0 and sum only the energy at the precise bin locations of the harmonics. Or, you could estimate the harmonic bin locations but then take the local max for each harmonic bin over some bandwidth (that's what Klapuri did in that paper). Or you could only sum the energy if there's a spectral peak in the expected frequency bandwidth centered on the harmonic bin. All three are different flavors of harmonic summation.
But you could, alternatively, apply sub-harmonic summation - in this case you don't treat every tf bin as a potential f0: you first pick spectral peaks and then using only those peaks you treat each peak as a potential harmonic and "sum it into" all tf-bins of which the peak could be a harmonic (that's what happens in Melodia for instance), where rather than mapping only to the precise bins f0/2, f0/3, f0/4, etc., you actually place some kernel (e.g. a gaussian or hann window) centered on the bin location, to account for slight inharmonicities.
Basically what I'm saying is that regardless of the post-processing you apply, there are various variants of the summation strategy itself to choose from.
@justinsalamon I don't think we're on the same page here.
If I understand correctly, you're describing ways of filtering, smoothing, and aggregating energy at different harmonics.
I'm talking about something different: a function that maps a spectrogram S
and list of harmonics h
to an array of spectrograms Sh
where Sh[k]
is just S
estimated at the h[k]
harmonic.
The point of this is to separate the logic for estimating energy at harmonically related frequencies from a specific time-frequency representation. That way, the same method can be applied equally well to linear (stft), logarithmic (cqt), inverse (tempogram), or non-uniform (ifgram) frequency spaces.
This gist demonstrates a first hack at this (the harmonics
function). The applications of the method in the notebook need a lot of work, but hopefully this illustrates the separation of functionality that I'm aiming for here.
@justinsalamon I don't think we're on the same page here.
Neither do I :) let's discuss in person on Tuesday, I have a hunch that'll be quicker. We can post a summary here for posterity here once we've synced up.
Neither do I :) let's discuss in person on Tuesday, I have a hunch that'll be quicker.
Sure. In hindsight I realized that I've totally derailed this thread from salience into harmonic estimation ¯(ツ)/¯
Following our discussion today, this thought came up - the basic approach for matching harmonics is by creating a tensor where each channel represents a transformed version of the input representation, e.g in the case of a spectrogram it would be "compressed", a constant-Q would be shifted. For the latter case, couldn't this frequency "compression" result in a detrimental loss of resolution, especially in the low frequency range?
To illustrate: say you have an f0 candidate at 50 Hz (and harmonics at 100, 150, etc.). On a linear frequency axis, assuming your frequency resolution is sufficient, you are able to distinguish between these harmonic peaks. However, after compressing the frequency axis (such that the second harmonic in the second channel is now aligned with the first harmonic in the first channel, if I understand correctly) you might end up with a representation that doesn't have sufficient resolution in the low-frequency range to distinguish between the harmonics, especially in the presence of other pitched sources.
To make a long story short - I'm not sure the process of creating the tensor (from a spectrogram) is equivalent to a salience function that operates on a linear frequency scale of the input representation.
Following our discussion today, this thought came up - the basic approach for matching harmonics is by creating a tensor where each channel represents a transformed version of the input representation, e.g in the case of a spectrogram it would be "compressed", a constant-Q would be shifted. For the latter case, couldn't this frequency "compression" result in a detrimental loss of resolution, especially in the low frequency range?
The semantics of the harmonic tensor are similar to shifting/compressing the spectrogram, but they're not equivalent. All we're doing is making a representation where Sh[i, j, k]
measures the i'th harmonic of the j'th frequency at time k. This measurement can be estimated in a variety of ways, but ultimately it all comes down to some form of interpolation. Nobody is claiming that the i'th harmonic plane has a linear frequency axis (or any other kind of scale, beyond what's implied by being harmonically related to the input scale).
To make a long story short - I'm not sure the process of creating the tensor (from a spectrogram) is equivalent to a salience function that operates on a linear frequency scale of the input representation.
No, it is not equivalent. But the frequency scale really shouldn't matter here; if you want linear or logarithmic or inverse, harmonic estimation
Chatted with @stefan-balke about this yesterday, and looking back into it, my initial gist has a few subtle bugs in it.
Unless there are objections, I'll try to hammer these out and get the harmonic interpolator merged first. We can then look into building a salience map on top of it, among other things (tempo estimation, f0, etc).
@bmcfee I'll give it a go
Hey there,
what about adding a salience function to librosa à la Justin Salamon for analysis tasks? One way to do it is with a Reassigned Spectrogram based on a instantaneous frequencies. Then add the subharmonics.
Testing material could come from Essentia.
Before collecting the stuff (I have everything somewhere), I wanted to ask if this is worth the work.
Best Stefan