Closed bryanhpchiang closed 1 year ago
if it is too much trouble with the difference in sampling freqs or w/e, we can just stick with the soundstream + encodec route
i think mel specs will fall out of favor anyways
Yup! Will port over the original HiFi-GAN repository. The issue is that the making the proposed test work for Voicebox will require retraining HiFi-GAN with an entirely new configuration (new sampling rate, segment size, etc.)
Might wanna take a look at vocos. https://github.com/charactr-platform/vocos It supports reconstruction from Melspectrogram and encodec.
Might wanna take a look at vocos. https://github.com/charactr-platform/vocos It supports reconstruction from Melspectrogram and encodec.
oh nice, yea, Manmay also mentioned this in the discussion; will take a look
@lexkoro @manmay-nakhashi hey, vocos looks good! i think the maintainer knows what he's doing
could you audiophiles explain why one would decode using vocos instead of just using the trained encodec decoder?
@lucidrains
@manmay-nakhashi thank you! btw, put up a sponsors button. i'd like to give more than just a few shallow appreciation bulletpoints :smile:
ok decided to go with Encodec / Voco pair for starters
import torch
from audiolm_pytorch.data import SoundDataset, get_dataloader
from voicebox_pytorch.voicebox_pytorch import EncodecVoco
ds = SoundDataset(
'/path/to/LibriSpeech',
target_sample_hz = 24000
)
dl = get_dataloader(ds, batch_size = 4, pad_to_longest = False)
audio, = next(iter(dl))
model = EncodecVoco()
encoded = model.encode(audio)
recon_audio = model.decode(encoded)
going with mel spec + vocos for now
nice start!
so i was thinking, to support both mel spec w/ hifigan, as well as raw audio with soundstream / encodec, we should just make two modules, with
.encode
and.decode
methodsthen we have a simple test that just does
module.decode(module.encode(raw_wave)) ~= raw_wave
do you think you want to give the mel spec + hifi gan route a try?
i can get the soundstream + encodec / residual VQ stuff done by next week's end