urinieto / harmonixset

The Harmonix Set: Beats, Downbeats, and Structural Annotations for Pop Music
MIT License
149 stars 24 forks source link

Annotations don't align with youtube downloaded tracks #9

Closed alexvoina closed 3 years ago

alexvoina commented 3 years ago

Hi,

I'd like to use this dataset to train one of my networks. I managed to download all the audio using download_youtube_mp3s.py but the annotations in dataset/beats_and_downbeats seem to be off. I'm using Sonic Visualiser to overlay the annotations on top of the audio, and I've randomly chosen tracks from the dataset but none of them appears to be correctly annotated (not even the ones that are marked with 1.0 alignment score). I can't make use of the Harmonix_melspecs.tgz because my pre processing chain is different.

I understand this problem might be caused by use of different CODECS or compression formats, and the Audio Alignment notebook is here to compensate for that, but it looks to me that the alignment code depends on some local files not available to the public.

I'd like to be able to experiment with this dataset, and for that I would either need some new beat/downbeat annotations that match the audio I've just downloaded from youtube, or a dropbox to the files for which the annotations in dataset/beats_and_downbeats are correct. Having both of them in the same place would be great.

Thanks

urinieto commented 3 years ago

Unfortunately, there might be some differences with some of the YouTube versions with the ones used to do the actual annotations, and I can't distribute the original audio files due to copyright issues.

Nevertheless, I am happy to provide additional spectral features computed from the original audio files. Please, let me know what type of features you might need (with their specific parameters) and I will create a new link if that works for you.

alexvoina commented 3 years ago

I guess this works but it doesn't give me much room for experimentation. Currently, I'm using the preprocessing chain in madmom's RNNDownBeatProcessor, namely a multi-resolution (1024, 2048, 4096) filtered log spectrogram & their spectral fluxes laid on top.

downbeats.py

define pre-processing chain

    sig = SignalProcessor(num_channels=1, sample_rate=44100)
    # process the multi-resolution spec & diff in parallel
    multi = ParallelProcessor([])
    frame_sizes = [1024, 2048, 4096]
    num_bands = [3, 6, 12]
    for frame_size, num_bands in zip(frame_sizes, num_bands):
        frames = FramedSignalProcessor(frame_size=frame_size, fps=100)
        stft = ShortTimeFourierTransformProcessor()  # caching FFT window
        filt = FilteredSpectrogramProcessor(
            num_bands=num_bands, fmin=30, fmax=17000, norm_filters=True)
        spec = LogarithmicSpectrogramProcessor(mul=1, add=1)
        diff = SpectrogramDifferenceProcessor(
            diff_ratio=0.5, positive_diffs=True, stack_diffs=np.hstack)
        # process each frame size with spec and diff sequentially
        multi.append(SequentialProcessor((frames, stft, filt, spec, diff)))
    # stack the features and processes everything sequentially
    pre_processor = SequentialProcessor((sig, multi, np.hstack))

It should yield a 1D vector of 314 values for each time frame :D

Thanks a lot for your time and for replying so quickly

urinieto commented 3 years ago

Great, thanks for sharing. I created a little script to compute these features, I'm running it right now: https://github.com/urinieto/harmonixset/blob/master/src/compute_madmom_audio_features.py

Once I have all the features computed and uploaded, I'll post the link here.

alexvoina commented 3 years ago

Thanks Oriol! Looking forwards to get my hands on that data 😄

urinieto commented 3 years ago

Few, this took some time to upload. Here you have them: https://www.dropbox.com/s/ayd15svnibqvcjw/Harmonix_set_madmom_features_20210715.tgz?dl=0 :)

alexvoina commented 3 years ago

It took some time to download too! Safari just couldn't take it :)) I'll do some alignment checks and let you know if everything looks good. Thanks again for your time!

alexvoina commented 3 years ago

Hi Oriol,

I'm trying to verify the validity of the annotations & features, and I've noticed (randomly) that some of them seem to be a bit ahead.

Here I'm printing the feature matrix for 0348_blackout-seq.npy. The pictures zoom on the bottom part (Y-axis), the 1024 log-filtered spectrogram which should be the most "time accurate". Is this kind of "error" expected within your dataset?

348_blackout

348_blackout_zoom

Here's the code I'm using to print:

import numpy as np import matplotlib.pyplot as plt

file = open(PATH_TO_ANNOTATION_FILE)

lines = file.readlines() file.close()

beat_indexes = [] for l in lines: parts = l.split('\t') beatpos = float(parts[0]) # taking just the time instants beat_indexes.append(np.round(beatpos / 0.01)) # divide beatpos by 10ms to find out the beat index

beat_indexes = np.array(beat_indexes, dtype='int') print(beat_indexes)

features = np.load(PATH_TO_FEATURES_FILE) max_value = np.max(features) features[beat_indexes, :] = max_value

plt.imshow(features.T, aspect='auto', origin='lower') plt.show()

*notice that I'm rounding in order to decide to which frame the beat position given in seconds corresponds, and that may introduce an error of maximum 5ms

urinieto commented 3 years ago

Hey Alex, that's an interesting find! That is not expected (i.e., the annotations shouldn't have any type of (negative) offset). If you have the time, I'd encourage you to create a PR with the corrections for a future Harmonix Set version.

alexvoina commented 3 years ago

I will need to take a bit more time to look at more files and conclude, but I'm happy to open a PR when ready. Btw, before computing the madmom features, did you check that the annotations in dataset/beats_and_downbeats were indeed properly aligned with your waveforms?

For instance, I would be curious to see where do the beat markers land on this section of your "0348_blackout.mp3". In case you are not aware of this, a very easy way to test this is to use Sonic Visualiser, drag the audio file in the window, and then drag the annotations text file as it is, and it will parse the rows and columns. You can then choose to ignore the segments, or beat numbering columns, or use them as labels.

Screenshot 2021-07-22 at 09 14 30

MCMcCallum commented 3 years ago

Hi Alex,

This could be due to decoder offset. Depending on the decoder you are using to load the samples from the compressed audio files, there can be a different amount of zero padding inserted or truncated from the beginning of files. This is evident when certain versions of librosa on say linux vs. macOS. On MacOS it will use the CoreAudio decoder, where on Linux it will use the FFMpeg decoder. You actually see the same difference in offsets when loading the same audio file via WebAudio in different browsers (e.g., Chrome vs. Safari vs. Firefox), depending on the codec and bitrate used. All results in the dataset were produced using FFmpeg on Linux IIRC. Perhaps the features were produced on a different platform?

On Wed, Jul 21, 2021 at 11:27 PM Alex Voina @.***> wrote:

I will need to take a bit more time to look at more files and conclude, but I'm happy to open a PR when ready. Btw, before computing the madmom features, did you check that the annotations in dataset/beats_and_downbeats were indeed properly aligned with your waveforms?

For instance, I would be curious to see where do the beat markers land on this section of your "0348_blackout.mp3". In case you are not aware of this, a very easy way to test this is to use Sonic Visualiser, drag the audio file in the window, and then drag the annotations text file as it is, and it will parse the rows and columns. You can then choose to ignore the segments, or beat numbering columns, or use them as labels.

[image: Screenshot 2021-07-22 at 09 14 30] https://user-images.githubusercontent.com/32097135/126596840-8cf9bff9-17b7-435c-9e2d-232e7069e6b3.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/urinieto/harmonixset/issues/9#issuecomment-884684035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELA437F64F6NCTOUNMTLWTTY6235ANCNFSM5AGMMDXQ .

urinieto commented 3 years ago

Yeah, I produced these features using macOS (with the madmom library specified in the info.json file in the tgz file above). It would be ok if all the offsets are consistent (that's an easy fix, likely due to the different decoders used as Matt mentioned), but it looks like this offset only happens in some of the annotations if I understand correctly?

alexvoina commented 3 years ago

Hi Matt, Oriol

Sorry for the late reply.

Nope, the offset might be present in all the files actually and it is probably because of the decoder (thanks @MCMcCallum), I've encountered this issue in the past. I randomly chose some songs that had a clean beat line such that the test is more obvious.

If the annotations in dataset/beats_and_downbeats were made using FFmpeg on Linux IIRC, then @urinieto should see the offset too, when loading the mp3 and laying the annotations on top.

From my experience with the decoding padding/truncation issue, the decoding offset did not have a constant value. Sometimes I'd see 1058 samples, sometimes 48, so I'm afraid applying a correction algorithm would be error-prone. Perhaps @MCMcCallum can help with a link, with the features computed on a linux machine?

Thanks!