librosa / librosa

Python library for audio and music analysis
https://librosa.org/
ISC License
7.04k stars 959 forks source link

Pitch detector #1102

Closed Tronic closed 3 years ago

Tronic commented 4 years ago

Hello,

I arrived to librosa while looking for libraries that could host my pitch detection algorithm. The algorithm is the third revision of the Performous vocal pitch detector, based on FFT reassignment method for finding precise frequencies, which are then combined into tones with most likely fundamental frequencies and their corresponding harmonics, and the third one I rewrote in Python/Numpy instead of C++ like the earlier ones. It can now detect one or few simultaneous tones with very little noise and accuracy better than 1 Hz, in real time (low latency, moderate CPU usage).

The question is, are you willing to take such non-standard algorithms into this library? FFT reassignment method could be one function and combination of harmonics another. Neither is specific to singing vocals, and as such could be used to detect any tonal signal. If that sounds good, I'll make a PR.

One consideration is that the algorithm needs to keep temporal state (overlapping FFTs and calculating phase differences). The current implementation is in class form so that such state can be preserved across incoming audio frames. If reduced to free function form like existing librosa algorithms, this cannot work in real time. If preferred, I could still restructure the algorithm into plain functions.

Tronic commented 4 years ago

I haven't done any formal tests (anyone care to instruct how to?) but here is a quick test against the yin implementation of #1063 - pyin was very similar, so I didn't include it separately:

Pitch detection

The audio file contains singing (with vibrato clearly seen on its fundamental and harmonics), and quieter piano sounds that are completely solid.

bmcfee commented 4 years ago

Thanks for this, it looks interesting! A few responses / questions for you:

The algorithm is the third revision of the Performous vocal pitch detector, based on FFT reassignment method for finding precise frequencies, which are then combined into tones with most likely fundamental frequencies and their corresponding harmonics, and the third one I rewrote in Python/Numpy instead of C++ like the earlier ones.

Do you have a slightly more detailed technical writeup of this method that we can look at?

The question is, are you willing to take such non-standard algorithms into this library?

I think this depends on A) how much new code would need to be developed, B) how reusable said code would be, and C) how this method compares to other methods (eg yin/pyin).

My hunch, based on your description, is that we wouldn't actually need much new code for this. Librosa already has reassigned spectrograms, harmonic interpolation, and saliency aggregation (see the docs): could your method be implemented using those without modification (or with minimal extension)?

If reduced to free function form like existing librosa algorithms, this cannot work in real time. If preferred, I could still restructure the algorithm into plain functions.

If we implement it in librosa, it will not be in a class form. We do have a pattern for stream processing that I think could accommodate this algorithm: take a look at the streaming pcen example in the gallery to see how that works.

I haven't done any formal tests (anyone care to instruct how to?)

Probably the thing to do is grab some pitch tracking datasets from http://ismir.net/resources/datasets/ , and evaluate the accuracy of your method with mir_eval.

Tronic commented 4 years ago

First part rewritten in librosaish style:

import librosa
import numpy as np
from math import tau

def freq_power_spectrum(
    y, n_fft=1024, hop_length=256, center=True, pad_mode='reflect', **stft_args
):
    if center:
        y = np.pad(y, (n_fft + hop_length) // 2, mode=pad_mode)
    # Normalisation, ought to be handled within stft
    y = 2**.5 / .612 / n_fft * y
    Y = librosa.stft(y, n_fft=n_fft, hop_length=hop_length, center=False, **stft_args)
    power = abs(Y)**2
    # Use phase differences to calculate true frequencies
    dp = np.diff(np.angle(Y) / tau, axis=-1)
    rp = np.fft.rfftfreq(n_fft)[..., None].astype(np.float32)
    # Return dimensions (freq|power) x bin x frame
    return np.stack([
        (np.round(rp * hop_length - dp) + dp) / hop_length,
        0.5 * (power[..., 1:] + power[..., :-1]),
    ])

Output: https://www.youtube.com/watch?v=ILhgmqPI56A

Perhaps there are already tools for FFT normalisation? I didn't want power values jumping all over the place depending on FFT settings, so I added a normalisation factor here (valid for hann/hamm windows).

All output frames are in effect midpoints between stft frames, but center=True adds extra padding to compensate if needed (output power then approaches zero on outermost frames).

y = librosa.chirp(100, 700, duration=1) + librosa.chirp(500, 500, duration=1)

freqs, powers = freq_power_spectrum(y, n_fft=512, hop_length=128, center=False)
print(f"Power preserved y={(y**2).mean():.2f}, Y={powers.sum(axis=0).mean():.2f}")

# Compare with normal FFT
for t in np.linspace(0, freqs.shape[-1] - 1, 10, dtype=int):
    plt.xlim(0, 1000)
    plt.ylim(0, 1)
    plt.plot(22050 * np.fft.rfftfreq(512), powers[:, t])
    plt.plot(22050 * freqs[:, t], powers[:, t], marker="o")
    plt.show()

The dimensions are somewhat clumsy to work with, perhaps they should be transposed? I added freq/power dimension first to allow unpacking expressions, the other two dimensions come from stft.

This part runs extremely fast, 3 ms per 48000 samples.

I suppose this should work with streams provided n_fft and hop_length different than the ones provided to this function but I didn't check this yet or calculate what their values should be.

Tronic commented 4 years ago

Second part: combine the duplicate frequencies. After this each frame has different number of peaks, so indexing changes and the return value is a plain Python list.

def group_by_freq(peaks):
    """Group similar frequencies into power-weighted mean freq and total power."""
    n_fft = 2 * (peaks.shape[1] - 1)
    timeseries = []
    for frame in peaks.transpose(2, 0, 1):
        # Sort by frequency
        frame = frame[..., np.argsort(frame[0])]
        # Combine similar peaks together by power-weighted average
        idx, = np.nonzero(np.diff(frame[0], prepend=0) > 1 / n_fft)
        p = np.add.reduceat(frame[1], idx)
        f = np.add.reduceat(frame[1] * frame[0], idx) / p
        frame = np.stack([f, p])
        # Sort by power (strongest first)
        frame = frame[..., np.argsort(-frame[1])]
        timeseries.append(frame)
    return timeseries

peaks = group_by_freq(freq_power_spectrum(y))

https://youtu.be/5m7TUVk6HTc

All summing of components must be done by signal power, not by magnitude (this is also the reason why the first stage instantly converts to power). A sine wave at amplitude 1 has 0.5 power (you need a square wave or a āˆš2 sine for unit power).

As seen on video, the frequency grouping reproduces very precisely the signals' true frequencies at 0.5 power, as long as the frequencies don't meet. Total power is 1.0 as expected.

When two frequencies exist in the same FFT bin, the total power varies between zero (opposite phase) and quadrupled (2.0, same phase), and that's the main reason for the violent action seen when the sweep crosses 700 Hz. This example uses a very small FFT size (512) to make the problem more apparent.

Tronic commented 4 years ago

The first part freq_power_spectrum produces essentially identical results to librosa's reassigned_spectrogram, although the algorithm I derived is quite different. My implementation is simpler, doesn't produce NaNs, runs faster and also calculates correct power values, but does not reassign times.

The third part is combining harmonics; while it can be done independent of time, looking at the surrounding frames does make it better. Handling streaming in this scenario gets trickier. The current approach is quite the opposite of interp_harmonics but I'll also look into that.

I'll get back to this maybe next week, and will also have a look at what to do with the first two parts to better fit within existing functions. As mentioned above, FFT power normalisation is a concern.

Looks like this will become a PR or two.

Tronic commented 4 years ago

Probably the thing to do is grab some pitch tracking datasets from http://ismir.net/resources/datasets/ , and evaluate the accuracy of your method with mir_eval.

A lot of datasets, and the few that I checked so far did not come with useful labelling. Got any specific ones that could be directly used with mir_eval.melody?

bshall commented 4 years ago

Hi @Tronic. For the PYIN implementation I'm working on I used MDB-stem-synth. It'd be interesting if you used the same dataset so we could have some direct comparisons. There are also some benchmarks in the CREPE paper which might be good targets.

Tronic commented 4 years ago

Looks like I'll be quite busy for the rest of the month and so couldn't make a PR yet but here's an update.

I've simplified to algorithm to into a few functions that can be used individually and out of which only the first one operates over time. This reduces the accuracy slightly but makes it easier to understand.

bestpitch is used very similarly to yin, and multipitch can be used to extract multiple tones and their combined powers for a more detailed analysis.

import librosa
import numpy as np
from math import tau

def freq_power_spectrum(y, n_fft=1024, hop_length=256, center=True, pad_mode='reflect', **stft_args):
    if center:
        y = np.pad(y, (n_fft + hop_length) // 2, mode=pad_mode)
    norm = 2**.5 / .612 / n_fft  # This ought to be handled within stft
    Y = librosa.stft(y * norm, n_fft=n_fft, hop_length=hop_length, center=False, **stft_args)
    power = abs(Y)**2
    # Use phase differences to calculate true frequencies
    dp = np.diff(np.angle(Y) / tau, axis=-1)
    rp = np.fft.rfftfreq(n_fft)[..., None].astype(np.float32)
    # Return dimensions (freq|power) x bin x frame
    return np.stack([
        (np.round(rp * hop_length - dp) + dp) / hop_length,
        0.5 * (power[..., 1:] + power[..., :-1]),
    ])

def group_by_freq(frame, tol=4096**-1, min_power=1e-7):
    """Group similar frequencies into power-weighted mean freq and total power."""
    frame = frame[..., (frame[0] > 0) & (frame[1] > min_power)]
    # Sort by frequency
    frame = frame[..., np.argsort(frame[0])]
    # Combine similar peaks together by power-weighted average
    idx, = np.nonzero(np.diff(frame[0], prepend=0) > tol)
    p = np.add.reduceat(frame[1], idx)
    f = np.add.reduceat(frame[1] * frame[0], idx) / p
    frame = np.stack([f, p])
    # Sort by power (strongest first)
    frame = frame[..., np.argsort(-frame[1])]
    return frame

def match_harmonics(freqs, powers=None, max_tones=4, min_power=1e-5):
    """Combine harmonics into tones."""
    tones = []
    if freqs.shape:
        if powers is None:
            powers = np.ones(freqs.shape)
        m = np.divide.outer(freqs, freqs)
        harm = m.round()
        harm[abs(m - harm) > 0.1] = 0
        m = powers[:, None] * (harm > 0)
        # Construct tones using peaks matched by the winning hypothesis
        while len(tones) < max_tones and m.any():
            winner = np.argmax(m.sum(axis=0))
            matched, = np.where(m[:, winner])
            tone = np.average(
                freqs[matched] / harm[matched, winner],
                weights=powers[matched],
                returned=True
            )
            if tone[1] < min_power:
                break
            tones.append(tone)
            # Clear the winning hypothesis and any matched peaks
            m[:, winner] = 0
            m[matched, :] = 0
    return np.array(tones, dtype=np.float32).reshape(-1, 2)  # Reshape in case of empty

def multipitch(y, n_fft=1024, hop_length=256, center=True, max_tones=4, min_power=1e-5):
    """Multitonal f0 detection."""
    spectr = freq_power_spectrum(y, n_fft, hop_length, center)
    return [
        match_harmonics(
            *group_by_freq(frame, tol=1.0 / n_fft)[:, :32],
            max_tones=max_tones,
            min_power=min_power,
        )
        for frame in spectr.transpose(2, 0, 1)
    ]

def bestpitch(y, n_fft=1024, hop_length=256, center=True, sr=1.0, min_power=1e-5):
    """Estimate the dominating fundamental frequency for y over time."""
    ret = multipitch(
        y,
        n_fft=n_fft,
        hop_length=hop_length,
        center=center,
        max_tones=1,
        min_power=min_power,
    )
    return np.array([
        sr * r[0, 0] if r.size and r[0, 1] > min_power else 0.0
        for r in ret
    ])
lostanlen commented 4 years ago

Thank you @Tronic for sharing your code, and especially for making it modular and functional like the above. The results you shared on singing voice look quite impressive. The fact that you were able to implement this in under 100 LOCs of pure Python, without any external dependency, is also great.

That being said, it's too early to merge this into librosa, considering that the algorithm hasn't been published yet. Beyond its role as a library of audio processing routines, librosa has a "mission" of being pedagogical and easily accessible to newcomers to the field. At the moment, the algorithm isn't exactly clear from just looking at the code: instructions such as

frame = frame[..., np.argsort(frame[0])]

do not have an immediate mathematical equivalent.

I would encourage you to write a peer-reviewed article about this algorithm, with a benchmark against pYIN as well as state-of-the-art pitch estimation (CREPE and SPICE come to mind). Perhaps you can team up with @bshall ?

What I would expect to see in the paper is that your algorithm is: (1) significantly better than pYIN (2) significantly faster than CREPE on CPU (that part should be easy ... šŸ˜Š ) By significantly better than pYIN, I mean that the raw pitch accuracy (RPA) of your algorithm is significantly below the RPA of pYIN in a paired t-test on MDB-stem-synth (p<0.005) See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html Furthermore, I would love to see a chart of RPA vs f0 for various instruments. pYIN tends to be excellent in the mid-frequencies but terrible in the low and high frequencies. It's important to document whether your algorithm is mostly aimed at vocal pitch or whether it works well on other instruments too.

On a separate note, it would be good if you could optionally return voicing confidence.

Also, pardon me for asking, but is there a name to your algorithm? Do you want to call it "Performous" or invent a name of your own? I could see arguments in either direction.

Tronic commented 4 years ago

I am not interested in publishing a paper. The index trickery in the occasion you mentioned is because AFAIK Numpy does not provide useful "table sort" otherwise, and other complexities come from following librosa ordering of dimensions, and from making the algorithm fast (rather than using Python loops or repeated operations for multiple 1d arrays).

If your aim is primarily pedagogical and academic, it would appear that I am mistaken here, and will need to look at pyAudioAnalysis or some other module to host it. Thanks for your comments anyway.

Tronic commented 4 years ago

I'm not entirely sure what you mean by voicing confidence, but the power value returned is a measure that it gives that may be used as an indication on how good the fit is.

There are many ways that this could be improved and extended. Rather than added statistics on the current implementation, I'd suggest looking into separating different tones' qualities by analysis over time: singing and solid frequency piano that might look identical at any single frame (hitting the same frequencies) but over time the singing frequency varies (vibrato or otherwise) and the envelope curves are apparently different. With such analysis, far more useful confidence statistics could also be extracted. I believe that state-of-the-art commercial applications (Celemony Melodyne) employ this sort of analysis.

Overall, I specifically want to hand this algorithm over because I cannot invest much more time into its development.

Naming: if someone wants to pick it up and give a name to a specific implementation (especially in harmonics combining there is plenty of room for changes even without temporal analysis), maybe even author a paper, feel free to pick any name you want.

bluenote10 commented 4 years ago

@Tronic Just chiming in to thank you for sharing your algorithm, much appreciated. I'm currently experimenting with it, and so far it yields very good results in my use case. Maybe once #1063 is done I can contribute a more quantitative comparison that may help to answer some of the open questions.

Requiring peer reviewed publications is in my opinion a bit limiting. A high quality algorithm that works well in practice and a theoretically novel research paper are two very different animals. There are so many highly efficient algorithms that are too simple to write a paper about, and theoretically nice papers that don't work so well in practice.

But I guess the maintenance aspect is something to consider.

stefan-balke commented 4 years ago

IMHO, I would not say there is no chance to include it in librosa. It shouldn't be exclusive to approaches from the scientific community.

However, it is important to see strengths and limitations of this approach before putting this into a library. There is nothing uncommon about it in my experience. Follow Vincent's advice, run the algorithm on some common test dataset, evaluate it using mir-eval and report results. If these are promising it's worth putting more effort into this approach.

bmcfee commented 3 years ago

It's been over a year with no activity on this, so I think we should close it out. I'll summarize the discussion here, but first, thanks @Tronic for offering to open-source this method and putting in the work to adhere to style guidelines!

It's true that we don't have a strong requirement that all methods come from published papers. However, in this case, we have neither a robust baseline evaluation of the method, nor a thorough document explaining how it works. I understand that @Tronic is not interested in publishing articles, but without any reference material, we have little recourse to documenting the code or advising users as to when the method may or may not be appropriate or preferable to other methods. That, coupled with the authors' stated intent to hand over maintenance of this code, makes it somewhat unappealing to accept as a contribution.

My general advice here would be to release the implementation as a stand-alone package. If at some point it ends up gaining significant traction, we can revisit the decision to integrate it into the library.