Closed turian closed 1 year ago
@turian yea sure, i can loosen the restrictions and allow the discriminator to be passed in
@turian ok, you can pass it into the Soundstream
as stft_discriminator
, once you get the authors of that network to return feature maps
@lucidrains cool but shouldn't that be called efficient_at_discriminator?
Sure. If you like, I can share the wrapper that I wrote.
VERY IMPORTANT:
# This might be slow to backprop through
if sample_rate == 32000:
self.resampler = torch.nn.Identity()
else:
self.resampler = torchaudio.transforms.Resample(
orig_freq=sample_rate, new_freq=32000
)
Since EfficientAT is trained on 32000 KHz audio. torch.resample passes gradients back, btw.
@lucidrains cool but shouldn't that be called efficient_at_discriminator?
the idea is you can instantiate with any discriminator and just pass it in
Just a speculative idea, that I've been playing around with internally:
EfficientAT, particularly the "mn40_as_ext" model, is a very high-performing pretrained audio embedding model. It's an EfficientNet CNN vision model distilled from PaSST which had the highest performing scores on FSD50K (general audio) class prediction in the HEAR Benchmark. This group's code is also very easy to use and they are very responsive on github.
The bleeding edge idea I propose as an option is that EfficientAT (with default model "mn40_as_ext") is used as a discriminator, i.e. only one class prediction. The only real change need is this:
With that, you can also retrieve the feature maps for learning feature matching in the generator.
In fact, my biggest compunction about this approach is that the loss drops so. damn. fast. because it's pretrained. Thus making it hard for the generator to catch up, unless it has a very very slow learning rate or TTUR or similar.