Implementing Barwise TF matrices (Barwise-aligned features).

urinieto / msaf

Music Structure Analysis Framework

MIT License

501 stars 79 forks source link

Implementing Barwise TF matrices (Barwise-aligned features). #154

Open ax-le opened 11 months ago

ax-le commented 11 months ago

Hi! This issue aims at discussing about implementing the Barwise TF matrix in the MSAF toolbox. The Barwise TF matrix is a feature representation sampled on bars, with a fixed number of frames per bar. It was introduced in [1], is detailed in [2, Chap. 2.4.2], but, more importantly, was shown to improve segmentation results on the traditional algorithm of Foote in [3, Sec. 2.3.3]. In that regard, I believe that this representation would be a great addition for MSAF. Still, I opened this issue because implementing such a representation will not be straightforward, and would certainly require major modifications in MSAF. In particular, it should be discussed whether this representation must be computed every time (as it is the case now for beat-synced features*), or if the computation must be optional and specified by a parameter. The most relevant people for this discussion here should probably be @urinieto and @carlthome ? Have a nice day! Best, Axel.

*Edit: I may be wrong on that point, maybe I confused "default" settings with "every time"

References

[1] Marmoret, A., Cohen, J. E., & Bimbot, F. (2022, June). Barwise Compression Schemes for Audio-Based Music Structure Analysis. In Sound and Music Computing 2022. Full text: https://arxiv.org/pdf/2202.04981.pdf. [2] Marmoret, A. (2022). Unsupervised Machine Learning Paradigms for the Representation of Music Similarity and Structure (Doctoral dissertation, Université Rennes 1). Full text: https://hal.science/tel-03937846/document. [3] Marmoret, A., Cohen, J. E., & Bimbot, F. (2023). Barwise Music Structure Analysis with the Correlation Block-Matching Segmentation Algorithm. Transactions of the International Society for Music Information Retrieval (TISMIR), 6(1), 167-185. DOI: 10.5334/tismir.167. Full text: https://hal.science/hal-04323556/file/tismir-6-1-167.pdf.

urinieto commented 11 months ago

Happy to see that the bar-synced features are getting traction! Currently, the easiest way to implement this on MSAF is to add a completely new set of features here: https://pythonhosted.org/msaf/features.html#

Basically, we would need:

est_barsync
ann_barsync

This way, the user could pass timestamp of the annotated downbeats (ann) and MSAF could also provide an algorithm to estimate them (est).

That being said, when I wrote this part of MSAF (around 9 years ago!), computing features was an expensive process. Nowadays, features are typically computed on the fly (since it's so cheap), which saves a lot of disk space (MSAF's feature json files can be quite big).

So... what I would suggest is to remove the temporary storage of features and just compute them on the fly (ie, get rid of those temporary JSON files). This way, we don't need to include these new types of features on the JSON files, and backwards compatibility in future MSAF releases would be much easier. And playing around with different custom features should be much easier.

This would be a major refactor of MSAF, but I think it would be totally worth it.

What do you folks think? @carlthome and/or @ax-le are you up for the challenge? :D

ax-le commented 10 months ago

Hi @urinieto and Happy New Year!

Firstly, while it would be interesting to implement barwise synchronized features, the features coined "Barwise TF matrix" mentioned in my first message necessitate a bit more work because they consist of computing a fixed number of samples (defined by a parameter) per bar. In other terms, while barwise synchronized features contain 1 sample per bar, Barwise TF matrices contain n samples per bar (typically 96). This would need additional modifications, to cope with bar discrepancies during the song (right now, this is handled via oversampling the spectrogram and selecting regularly spaced samples in each bar, it may be debated).

Secondly, I could be down for refactoring code, but not right now! ;) Maybe it can wait as this does not seem urgent, but maybe someone would be available sooner (@carlthome ?).

urinieto commented 10 months ago

Oh gotcha, I understand. Yeah, I think this begs even further for a potential refactoring of the way MSAF takes care of the features. I would actually love to do that.. but likely not gonna happen after ISMIR 2024 organization 😅 (unless @carlthome is up for the challenge!)

ax-le commented 1 month ago

Hi @urinieto!

I hope that ISMIR 2024 is on a good track! Good luck with the last work in organizing it :)

Quick update on this issue: I talked with my good friend @Yoz0, who is interested in implementing the "Barwise TF matrix representation" on the toolbox. In the discussion, we noticed that there may be two different things to implement actually:

Bar estimation and bar-synced features: This would extend beat-synced features to bar-synced features by adding some code/toolbox for estimating bars. This might need a bit of discussion because the toolbox that I used in my Ph.D. thesis (madmom) cannot be correctly installed with pip right now. But, apart from that, it would probably not be that complicated.
Time-frequency vectors: This part consists of changing how the musical/frequency content is represented in a time interval. Right now, the default feature is the beat-synced feature, consisting of 1 sample per beat. A possible extension could be to represent each beat with n equally spaced samples. This is what I call the "Beatwise-TF" matrix in [1, Sec. 4.4]. This is agnostic of beats, bars, or any time discretization: it consists of considering the time-frequency information inside a time interval as a vector, of size n * f, with f the frequency dimension, and n the number of samples to be kept.

Barwise TF Matrix, which gives promising results in [1] would require both these developments.

In my opinion, implementing 2. is the most important and complicated part of the problem. I would be really interested to study to which extent Beatwise TF features improve the segmentation performance of the existing algorithm compared with beat-synced features, my intuition is that it will (also given results found in [1, Sec. 4.4]). However, this study has not been done yet. It could even result in a workshop paper (imo). Still, it would probably require a strong refactoring of the code as it is. @Yoz0 would be interested in implementing 2. when he finds time, would it be ok for you?

Have a nice day! Best, ax-le.

[1] Marmoret, A., Cohen, J. E., & Bimbot, F. (2023). Barwise Music Structure Analysis with the Correlation Block-Matching Segmentation Algorithm. Transactions of the International Society for Music Information Retrieval (TISMIR), 6(1), 167-185. DOI: 10.5334/tismir.167. Full text: https://hal.science/hal-04323556/file/tismir-6-1-167.pdf.

urinieto commented 1 month ago

Hi @ax-le , good to hear that you and @Yoz0 have been discussing this further!

I may not have the full context (haven't read the full publication yet), but for number 2 would it make sense to have just variable number of frames in between beats (for tempo varying tracks)? This way, we could simply generate the standard STFT/melspectrogram without having to deal with beats at all (i.e., instead of beat-sync you can use frame-sync).

Does this make sense? Happy to chat more about this!

ax-le commented 1 month ago

Hi @urinieto!

Well, you definitely could, but that's not the rationale behind this feature. These features are actually agnostic of tempo. Basically, the idea behind the TF feature (resp. barwise and beatwise) is to have exactly one feature per beat (resp. bar), with each feature contains the same number of frames (which is a parameter). Hence, if the tempo increases, you still have the same number of frames in each feature, but the gap between consecutive frames is smaller in absolute time.

Alternatively, one TF feature can be seen as an extreme case of the Serra et al. Structural Features [1], where each structural features contains exactly one beat (hence $m$ is fixed as a parameter, but $\tau$ dynamically adapts to fit exactly one beat (resp. bar), i.e. $\tau = 1/m * beat_length$). A key difference though is that there is no overlap between consecutive features in a TF representation. So each feature corresponds exactly to one beat and only one.

The advantage here is that each TF feature represents approximately the content of one beat (resp. bar). So, when you compare the similarity between features in a self-similarity matrix, you compare beats (resp. bars), as opposed to just the first frame of the beat (as in the beat-synced feature).

Moreover, in our study, we considered being agnostic of tempo to not be an issue, since we focused on pop music, where the tempo remains relatively steady inside a song. Small variations are typically inconsistencies caused by musicians being humans and not robots (and sometimes not playing to a click). In this context, we view this as a positive rather than a negative aspect.

Technically, we obtain this representation by oversampling the original spectrogram and then selecting a subset of n frames, equally spaced within each beat (resp. bar).

I hope this clarifies the TF representation!

Feel free if you still have questions! :)

Best, ax-le.

[1] Serra, J., Müller, M., Grosche, P., & Arcos, J. L. (2014). Unsupervised music structure annotation by time series structure features and segment similarity. IEEE Transactions on Multimedia, 16(5), 1229-1240.

urinieto commented 1 month ago

Hi @ax-le ,

Thanks for the explanation, I think I get it now! Here's some ideas as to how to implement this:

Add TF as a new feature type: https://msaf.readthedocs.io/en/latest/features.html#adding-new-features-to-msaf
Add a parameter in the compute_beat_sync_features function, so that you can choose the number of frames per beat to be returned. So, by default fpb would be 1, so that this would be equal to the usual beat synchronous features.

Does this make sense?

ax-le commented 1 month ago

Hi @urinieto, I'm not sure to get your first point, because the TF can actually be any feature (pcp, cqt, stft, ...). It is some kind of "meta-feature", I would say? Conversely, the second point seems ok to me. What do you think, @Yoz0? We could start with it and see what troubles we encounter. Thanks a lot for your guidance and support @urinieto! Best, ax-le.

urinieto commented 1 month ago

oh right, that makes sense! Please disregard point one 🙈

In that case, we should probably add two new metafeatures, est_tf_features and ann_tf_features, here. The former would be computed from estimated beats, and the latter from annotated beats.

Maybe there's a better way of doing this, though.

Hope this helps!

Yoz0 commented 3 weeks ago

Hello !

I started working on this here. I added two new metafeatures est_beatsync_features_mfpb and ann_beatsync_features_mfpb. Wtih a frames_per_beat set to 3 for no particular reason.

I'm unsure what is the proper way to allow the user to use the new metafeatures and the number of frames per beat. My guess is that it should be a new parameter in select_features, but I'm not 100% sure.

urinieto commented 3 weeks ago

Hi, I checked the fork and it's looking good. I also think that we may want to add a new parameter to select_features, unless we can think of a more elegant way of doing this. If we're to add a new parameter, we can add it at the end, as an optional argument (something like mfpb_features=False).

A little comment: I'd rather use something more self-explanatory than mfpb to identify these "multiple frames per beat" meta-features. How about multibeat?

Finally, please make sure you add unit tests here and that all tests pass (old and new): https://github.com/Yoz0/msaf/blob/tf-matrix/tests/test_features.py

Thanks!