lucidrains / voicebox-pytorch

Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch
MIT License
589 stars 49 forks source link

log mel func in torch #2

Closed bryanhpchiang closed 1 year ago

lucidrains commented 1 year ago

nice start!

so i was thinking, to support both mel spec w/ hifigan, as well as raw audio with soundstream / encodec, we should just make two modules, with .encode and .decode methods

then we have a simple test that just does

module.decode(module.encode(raw_wave)) ~= raw_wave

do you think you want to give the mel spec + hifi gan route a try?

i can get the soundstream + encodec / residual VQ stuff done by next week's end

lucidrains commented 1 year ago

if it is too much trouble with the difference in sampling freqs or w/e, we can just stick with the soundstream + encodec route

i think mel specs will fall out of favor anyways

lucidrains commented 1 year ago
Screen Shot 2023-08-06 at 10 17 08 AM
bryanhpchiang commented 1 year ago

Yup! Will port over the original HiFi-GAN repository. The issue is that the making the proposed test work for Voicebox will require retraining HiFi-GAN with an entirely new configuration (new sampling rate, segment size, etc.)

lexkoro commented 1 year ago

Might wanna take a look at vocos. https://github.com/charactr-platform/vocos It supports reconstruction from Melspectrogram and encodec.

lucidrains commented 1 year ago

Might wanna take a look at vocos. https://github.com/charactr-platform/vocos It supports reconstruction from Melspectrogram and encodec.

oh nice, yea, Manmay also mentioned this in the discussion; will take a look

lucidrains commented 1 year ago

@lexkoro @manmay-nakhashi hey, vocos looks good! i think the maintainer knows what he's doing

could you audiophiles explain why one would decode using vocos instead of just using the trained encodec decoder?

manmay-nakhashi commented 1 year ago

@lucidrains

  1. they are claiming better UTMOS and better Lstft then encodec decoder.
  2. same framework can work with spectrogram and codecs.
lucidrains commented 1 year ago

@manmay-nakhashi thank you! btw, put up a sponsors button. i'd like to give more than just a few shallow appreciation bulletpoints :smile:

lucidrains commented 1 year ago

ok decided to go with Encodec / Voco pair for starters

import torch
from audiolm_pytorch.data import SoundDataset, get_dataloader
from voicebox_pytorch.voicebox_pytorch import EncodecVoco

ds = SoundDataset(
    '/path/to/LibriSpeech',
    target_sample_hz = 24000
)

dl = get_dataloader(ds, batch_size = 4, pad_to_longest = False)
audio, = next(iter(dl))

model = EncodecVoco()

encoded = model.encode(audio)
recon_audio = model.decode(encoded)
lucidrains commented 1 year ago

going with mel spec + vocos for now