rakuri255 / UltraSinger

AI based tool to convert vocals lyrics and pitch from music to autogenerate Ultrastar Deluxe, Midi and notes. It automatic tapping, adding text, pitch vocals and creates karaoke files.
MIT License
287 stars 26 forks source link

Optimise pitch #73

Open rakuri255 opened 1 year ago

rakuri255 commented 1 year ago

What we need in any case is a short test file where the notes are very clear.

Then the whole chain has to be checked, because every editing step could change the sound.

Also activate --plot True, this makes it easier to analyse

Originally posted by @rakuri255 in https://github.com/rakuri255/UltraSinger/issues/40#issuecomment-1596826359

BWagener commented 1 year ago

I have a couple of ideas regarding this topic.

The step which chooses a pitch for a syllable / word currently picks the most common note within the words start-end time.

This works well enough for most syllables. If the syllable has a shift or multiple shifts in pitch however, the result would still be a single note.

I.e. consider the single syllable word "test" being sung continuously first in the pitch of A2 for 2s, then shifting up to B2 and sustaining that pitch for 2s and lastly shifting back to A2 for another 2 seconds.

The result with the current implementation is A2 for 6s with the text "test "

Whereas a more accurate result would be A2 for 2s with the text "test" B2 for 2s with the text "~" A2 for 2s with the text "~ "

There are a couple of approaches to achieve this. I have a very rudimentary implementation as a proof of concept on this branch: feature/pitch-clustering

This approach simply splits every syllable into segments of a fixed size and lets the current implementation pick the most common pitch for each segment. Finally all segments where the pitch didn't actually change are merged back together.

There have to be many better ways to better track pitch changes within a syllable.

A perfect implementation would take into account:

Going off of this - the frequency, confidence, time and volume (I'm sure there are more than I can think of right now) are all dimensions that could be assigned to every datapoint. The task then is to find and parametrize a clustering algorithm which produces the best results.

To be able to assess the result of a clustering algorithm (and it's specific parametrization) I suggest developing a grading framework, where we input high quality manually transcribed songs and compare the output of the clustering algorithm to the notes and their timings in the manual transcription.

This could be done at scale with many "known good" songs. For every song we cache it's frequency and confidence data determined using crepe in addition to all other potentially relevant dimensions for clustering (this saves a lot of time when testing many different algorithm/parameter pairs).

With a framework like that in place the final result of a test could look like:

Algorithm Parameter A (for example, Priority given to frequency proximity) Parameter B (for example, priority given to time proximity) Parameter C Score (determined by comparing overlap/disjunction to the "known good")
Algorithm A 1 0.5 x 60%
Algorithm A 0.8 0.5 x 65%
Algorithm A 0.7 0.3 x 75%
Algorithm B 1 0.5 x 80%
Algorithm B 0.8 0.5 x 83%
Algorithm B 0.7 0.3 x 90%

Is this even worth it?

Considering the original example of the "test" syllable at the top, the rudimentary implementation addresses it well enough. So why make the effort to optimize edge cases where a long syllable with pitch shift might not be perfectly tracked with a simple implementation?

My reason is the following idea:

Currently the pipeline of the program looks as follows:

Split vocals (+ denoise) -> Speech2Text -> Hyphenation -> Pitch detection (+ determine most common note for each word)

And I think it already produces impressive results!

Keeping in mind the most time consuming part of manual transcription is perfectly timing and pitching each line of a US txt file, wouldn't it be good to have the automation of this part of the work perfected? What if the approach of finding appropriate pitches for each transcribed syllable would be inverted? Meaning, first we produce a list of notes, each with their start and end times, and then we find the appropriate syllables for these times from the transcription data!

My theory is, that using a good clustering of the frequency data crepe produces (+ other discussed dimensions) should deliver much more accurate note timings than using the transcribed words' timings as a baseline (this is in part due to the timings of words not being perfect, I often experience a word being cut short when compared to what the singer sings).

Another way to think about it is - after vocal separation, any part that remains in the audio, which has reasonably minimum volume, reasonably stable pitch and reasonably high confidences should be a part we'd want to sing! And the solution for most accurately identifying these parts in my opinion would be a clustering algorithm with optimized parameters for our specific domain.

Cheers!

rakuri255 commented 1 year ago

Wow a really good idea and description! Thanks for your effort! Yes, I agree with you completely and it has always been my goal that the output is as accurate as possible. And your idea is exactly the right way. I had been thinking about a Unit Test that always tests certain edge-cases. E.g., that different pitches in a long sung "A" are recognized, which exactly reflects your description. If we have this Unit Test, we can try different algorithms and optimize to it. Also, something like long pauses in a word, which is currently represented as a single singing note. E.g., if we take "test" and sing it "Te~~........eee......eeest" where "..." is a pause, here also as you already described everything is taken as a single note.

Writing the unit test would not be a problem, but as you mentioned getting quality sound material. And above all, it should be CreativCommon.

I'm going to take a look at your branch. As I have now read it will be a very good first step in terms of output quality!