spotify / basic-pitch

A lightweight yet powerful audio-to-MIDI converter with pitch bend detection
https://basicpitch.io
Apache License 2.0
3.19k stars 248 forks source link

Support for drums/percussion? #30

Open tripathiarpan20 opened 2 years ago

tripathiarpan20 commented 2 years ago

Hi! Thanks for this amazing open-source work, I'm really enjoying using it. :)

I noticed that Basic Pitch works great for tracks with single mono/polyphonic instrument for most instruments, however it is unable to encode drums at all.

I understand that MIDI encoding for drums/percussion instruments is somewhat different compared to the rest of the instruments, but are there any future plans to add support for percussion instruments?

jugoodma commented 2 years ago

@tripathiarpan20 -- I found your comment interesting, so I took a short dive into the literature.

There's a niche, and interesting, sub-sub-field of Music Information Retrieval (MIR) called Automatic Drum Transcription (ADT). Here's a literature review of ADT. The authors of that review describe different "drum transcription tasks" -- drum-only transcription (DTD) and drum-plus-accompaniment transcription (DTM) seem particularly relevant.

If you want to "solve" drum encoding, you could look at some of the methods in the more recently referenced papers in the mentioned literature review and give them a try! Ref 80 appeared to have high scoring metrics, but might not work for drum kits with more than a kick, snare, and hi-hat. The authors (of ref 80) also have a GitHub repo, and a demo site linked!

For another approach, you might find https://github.com/magenta/mt3 interesting/useful. Unfortunately, the related paper doesn't focus too heavily on drums, so you might find the mt3 model doesn't work that well for drum transcription.

Finally, perhaps we could make use of Facebook's demucs. This model is seemingly SOTA for demixing audio tracks, so we can use it to separate out the drums stem of a track. This turns a DTM task into a DTD task quite effectively (and thus, in my opinion, makes solving ADT easier). Unfortunately, this somewhat disregards the call-to-action in the NMP/basic-pitch paper -- to encourage low-resource models in future research. Maybe we can trim down the demucs model? Regardless, perhaps we could then train the NMP model on a drum-specific dataset, like E-GMD. We could then compose the architectures like so:

                demucs                   NMP(E-GMD)
original track -------> drum-only track -----------> drum-only MIDI

I'll give this a try, and post on the results. Luckily, since NMP is so light it probably trains much faster than huge models, And who knows, maybe demucs isn't even needed. Or, maybe this entire approach won't work! It's all part of the scientific method 😄

rabitt commented 2 years ago

are there any future plans to add support for percussion instruments?

@tripathiarpan20 no plans at the moment, but will let you know if that changes. @jugoodma 's comment is great, and points to some open source drum transcription options. Here are two more open source systems I'm aware of: (1) "Increasing Drum Transcription Vocabulary Using Data Synthesis" by Cartwright et. al [paper] [code] (2) "Towards Multi-Instrument Drum Transcription" by Vogl et. al [paper] [code]

tripathiarpan20 commented 2 years ago

Hi @jugoodma and @rabitt , Thank you for the amazing feedbacks!

To be frank I am not familiar with how the instrument class is predicted in the NMP pipeline, but if retraining the Basic Pitch's architecture on Drum dataset for DTD along with devising the suitable posteriorgram post-processing works, I believe that it would make the domain of instruments in this project truly whole (afaik).

Good luck on the process and keep us updated :D. The DTD task seems to be the relevant one in the context of Basic Pitch (which deals with polyphonic recordings of a single instrument class), demucs shouldn't be required given its high inference time and the availability of the E-GMD dataset & conversion to drum audio tracks with suitable soundfonts and label preserving data augmentation.

Elsewhere, I also tried demucs on Psychosocial(Slipknot) & tried to use basic-pitch on the demixed drum track, and that's how I eventually raised the issue/question. Although demucs has amazing performance, the inference times are relatively higher (typically takes minutes).

Meanwhile, perhaps Spotify could develop a lightweight demixing model which might benefit from end-to-end deep learning that uses CQT for preprocessing (rather than Mel spectrograms as in past demixing methods) in the future? It might be bit of a stretch as my understanding of the working of spectrograms, past Demixing models & NMP has missing pieces. I would especially like to hear @rabitt 's thoughts on the feasibility of such a lightweight demixing model and whether there would be any benefits if it is formulated as an end-to-end (demixing + transcription) task.

Any feedback from anyone else is welcome too!

sslupsky commented 6 months ago

@jugoodma Did you get around to attempting retraining as described above?

            demucs                   NMP(E-GMD)

original track -------> drum-only track -----------> drum-only MIDI