Open jarredou opened 1 year ago
Hi @jarredou,
we published the source code, and the paper is now on ArXiv! It is just a preprint though, we have submitted the paper to Pattern Recognition Letters. You can take a look!
Best
Thanks ! Great paper already ! I don't see any training code, have you planned to also release it (maybe when final paper will be published) ?
PS : I've made a quick Google Colab adapation for inference: https://github.com/jarredou/larsnet-colab
Hi @jarredou,
We developed LarsNet using an early version of StemGMD. Therefore, it will take us some time to refactor the code and have it working with the off-the-shelf version available on Zenodo.
We plan to release the training code soon. I am sure it will be available by the time the article is published. In the meanwhile, we added a section in the README.
Thanks for the colab, it's a great idea!
The inference speed is really mindblowing, even on CPU, that's really amazing, congrats for that !
About the quality, do you think that with more epochs the baseline models would perform better ? Because 22 epochs seems quite low seen from outside, and some project like drumsep (demucs-based, with smaller and private dataset but with more sound diversity) are getting quite qood results, probably with more training epochs for each models. What do you think ?
Hi @jarredou,
We process 110k clips per epoch; with a batch size of 24, this corresponds to just above 4500 batches. This means that each U-Net model is trained for about 100k steps, which is pretty standard. After 100k steps, the validation loss had already stop decreasing, so I reckon we'd need more than increasing the number of epochs to improve the output quality.
We already have few ideas for a v2. Most importantly adding synthetic drums to the dataset, but also improve the robustness to stereo imaging that we noticed can sometimes cause problems. Which artifacts are you more concerned with?
I'll try and take a look at drumsep
in the next few days. (We were not aware there was already a drums demixing model out there, thanks for the heads up!)
I can't talk for everybody, but for most of my own use cases, I prefer separations with occasional bleed but with full sounding targeted stem than separations with no bleed, but missing some content in the targeted stem (with "underwater"-like sounding on some parts). Occasional bleed is easier to remove with auto/manual postprocessing, the missing content is way more difficult to handle. But I know that other people prefer it the other way.
Using Demucs like Drumsep did is a good idea because Demucs until recently (see next message) was the best open-source architecture to separate drums from full mixture, better than KUIElab's TFC-TDF-Net when trained on the same dataset.
(The drumsep model original download link is dead, but it was shared later in that issue: https://github.com/inagoy/drumsep/issues/3#issuecomment-1807007531, and it was a student project, there is no publication related to it.)
Side-note: lucidrains has open-sourced SAMI-Bytedance's work (which is the current SOTA in music source separation, by a quite big step): https://github.com/lucidrains/BS-RoFormer/
You may also find interesting this work, aimed at enhancing source separated audio with a GAN : https://github.com/interactiveaudiolab/MSG
Sure thing! Demucs is arguably a better architecture than our Spleeter-like model. Nevertheless, at this point, we mainly wanted to showcase StemGMD by releasing a baseline for future research. This is why we decided to start from a simpler architecture. We will try better architectures as we go forward!
For what concerns bleed vs hard separation artifacts, you may want to play around with the α-Wiener filter. We noticed that choosing α<1 may sometimes lead to more natural sounding stems while allowing for more bleed. You can try and specify the option -w
with a value of 0.5
. This would nonlinearly modify the masks by applying square root compression. Namely, you could run
$ python separate -i path/to/input/folder -o path/to/output/folder -w 0.5
The best α really depends on the input track, but it's worth trying different values as it may produce more appealing results.
Hi @jarredou,
We plan to release the source code and pretrained models after the publication of our paper "Toward Deep Drum Source Separation," which is currently under peer-review. Unfortunately, we do not have a set date yet.
Regardless, the full dataset is now freely available on Zenodo.