HTDemucs (Hybrid Transformer Demucs)

pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch

https://pytorch.org/audio

BSD 2-Clause "Simplified" License

2.43k stars 635 forks source link

HTDemucs (Hybrid Transformer Demucs) #2936

Open turian opened 1 year ago

turian commented 1 year ago

🚀 The feature

HTDemucs (Hybrid Transformer Demucs) model

Motivation, pitch

torchaudio currently supports HDemucs (Hybrid Demucs). Facebook has just released the code for HTDemucs (Hybrid Transformer Demucs) which is state of the art and far superior the HDemucs.

Alternatives

Use the facebook htdemucs code instead of torchaudio.

Additional context

htdemucs is now the default in FB's demucs repo and pypi package.

carolineechen commented 1 year ago

Hi @turian, thanks for the feature request! TorchAudio generally aims to add models that have demonstrated more sustained success, which often means not adding new SoTA models when they initially come out. This is also dependent on user interest though, so we're curious what you're interested in using HTDemucs in torchaudio for?

And since we already have the HDemucs architecture, adding HTDemucs may not be too difficult, so we'll definitely keep this in mind and track it in the upcoming months and see if the team (or an external user) has the bandwidth to add it!

hmartiro commented 1 year ago

+1 for adding HTDemucs!

turian commented 3 weeks ago

@carolineechen is compared SOTA for a while, besides RoFormer which is the new hotness. And RoFormer is only slightly better, takes much longer to train, and subjectively isn't as good for music producer listeners because its spectrograms are too clean.

turian commented 3 weeks ago

@carolineechen Another benefit of HTDemucs versus HDemucs is that it a much more flexible model to work with.

Section 3 of Rouard et al 2022:

"Unlike the original Hybrid Demucs which required careful tuning of the model parameters (STFT window and hop length, stride, paddding [sic], etc.) to align the time and spectral representation, the cross-domain Transformer Encoder can work with heterogeneous data shape, making it a more flexible architecture."