Open turian opened 1 year ago
Hi @turian, thanks for the feature request! TorchAudio generally aims to add models that have demonstrated more sustained success, which often means not adding new SoTA models when they initially come out. This is also dependent on user interest though, so we're curious what you're interested in using HTDemucs in torchaudio for?
And since we already have the HDemucs architecture, adding HTDemucs may not be too difficult, so we'll definitely keep this in mind and track it in the upcoming months and see if the team (or an external user) has the bandwidth to add it!
+1 for adding HTDemucs!
@carolineechen is compared SOTA for a while, besides RoFormer which is the new hotness. And RoFormer is only slightly better, takes much longer to train, and subjectively isn't as good for music producer listeners because its spectrograms are too clean.
@carolineechen Another benefit of HTDemucs versus HDemucs is that it a much more flexible model to work with.
Section 3 of Rouard et al 2022:
"Unlike the original Hybrid Demucs which required careful tuning of the model parameters (STFT window and hop length, stride, paddding [sic], etc.) to align the time and spectral representation, the cross-domain Transformer Encoder can work with heterogeneous data shape, making it a more flexible architecture."
🚀 The feature
HTDemucs (Hybrid Transformer Demucs) model
Motivation, pitch
torchaudio currently supports HDemucs (Hybrid Demucs). Facebook has just released the code for HTDemucs (Hybrid Transformer Demucs) which is state of the art and far superior the HDemucs.
Alternatives
Use the facebook htdemucs code instead of torchaudio.
Additional context
htdemucs is now the default in FB's demucs repo and pypi package.