urinieto / harmonixset

The Harmonix Set: Beats, Downbeats, and Structural Annotations for Pop Music
MIT License
149 stars 24 forks source link

How to align the purchased audio with onsets data. #5

Closed blackpaintedman closed 3 years ago

blackpaintedman commented 4 years ago

Hi there! I'm kinda new to MIR and just getting hands on music structure analysis task. Scouting around and found this dataset seems reliable.

While this paper from ISMIR 2019 shows results pretty promising. So I'm wondering if it can be reproduced.

I "purchased"the audio, followed #1 & ./notebook/Audio Alignment.ipynb and managed to get the onset dtw result(codes below). And I'm actually confused about how to align the audio with just 30 seconds of the original onsets data. Would you kindly demostrate that?

BTW, how do you think about the method proposed in the ISMIR 2019 paper?

Much appreciated!

def jams_to_ons(file_id): 
    jfilepath = os.path.join(JAMS_PATH, file_id + ".jams")
    j = jams.load(jfilepath)
    ts = [obs[0] for obs in j.annotations[2]['data']]
    return ts`
def compute_alignment(file_id, align_thres=0.9, is_plot=False):
    """Main function to do the alignment between two songs of the same id.
    """
    # Load mp3s
    purc_path = os.path.join(PURC_MP3_PATH, file_id + ".mp3")
    purc_x, _ = librosa.load(purc_path, sr=SR)`

    # Compute onsets
    orig_ons_30s = jams_to_ons(file_id)
    onset_frames = librosa.onset.onset_detect(y=purc_x, sr=SR)
    purc_ons = librosa.frames_to_time(onset_frames, sr=SR)
    purc_ons = [float(round(x, 3)) for x in purc_ons]

    # Apply DTW
    D, wp = librosa.sequence.dtw(X=orig_ons_30s, Y=purc_ons, metric='euclidean', subseq=True)

    score = alignment_score(wp)
    print(score)

    # Return reconstructed signal and score
    return reconstruct_signal(orig_ons_30s, purc_ons, wp), score
urinieto commented 4 years ago

This is a very good question, and unfortunately I don't have a good answer for it. The original audio from Harmonix is often time stretched and/or has a slightly different segment structure than available to purchase audio (which should be the same available on YouTube).

One could do some sort of a reverse engineering strategy given the annotations and the purchased audio. But this would be a tedious task, which I'm unsure how to potentially optimize.

If you (or anybody else!) has some free cycles to work on this, I'd be happy to team up, since I think the research community would highly appreciate it.

Thanks!

urinieto commented 4 years ago

Oh, and I never replied your question: to align the first 30 seconds, you could run your same onset detection function in your purchased audio and align it using the point of maximum correlation between the two onset curves. Does this make sense?

blackpaintedman commented 4 years ago

Ah thanks for the response. Yeah I'm gonna do some data cleansing and see if I can get some good useful segmentation data. It's interesting lots of audio from youtube comes with intro of the music video and being much longer than the original music. I'll keep this updated if got significant progress.

The annotation work is indeed tedious. I guess a solution is needed regarding the copyright limit of the audio. I'll definitely let you know if got one!

urinieto commented 3 years ago

Good news! We just released the mel-scale spectrograms for the full dataset. You can find them here: https://www.dropbox.com/s/zxnqlx0hxz0lsyc/Harmonix_melspecs.tgz?dl=0

I'm aware that this does not explicitly answer your question, but we hope that these data are enough for you to be able to work with this dataset. Let us know if you have any questions/comments!