v-iashin / SpecVQGAN

Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
https://v-iashin.github.io/SpecVQGAN
MIT License
339 stars 38 forks source link

Reconstruct mel spectrogram from librosa #11

Closed clairerity closed 2 years ago

clairerity commented 2 years ago

Hello! first of all, thanks for this wonderful repo. I would just like to ask as how to reconstruct the mel spectrogram i generated from librosa? I can do this via VQGAN using this code:

def reconstruct_with_vqgan(x, model):
  z, _, [_, _, indices] = model.encode(x)
  xrec = model.decode(z)
  return xrec

the xrec is the reconstructed image (from VQGAN)

I also add a preprocessing step before reconstructing using this code (same one from DALL-E's VQVAE):

def preprocess(img): 
    s = min(img.size)

     if s < target_image_size:
        raise ValueError(f'min dim for image {s} < {target_image_size}')

    r = target_image_size / s
    s = (round(r * img.size[1]), round(r * img.size[0]))
    img = TF.resize(img, s, interpolation=PIL.Image.LANCZOS)
    #img = TF.center_crop(img, output_size=2 * [target_image_size])
    img = torch.unsqueeze(T.ToTensor()(img), 0)
    return img

in the end i just call these 2 functions to reconstruct the image

img = PIL.Image.open(image).convert("RGB") # input is the mel spectogram in image form
x_vqgan = preprocess(img)
x_vqgan = x_vqgan.to(DEVICE)

x2 = reconstruct_with_vqgan(x_vqgan, model32x32) # model32x32 is the VQGAN model
x2 = custom_to_pil(x2[0]) # final reconstructed image 

I was wondering how I could use your model instead to reconstruct in a way that it is similar to this. I just checked the demo and saw that it extracts audio from the video. I'm thinking as to how I can directly reconstruct the mel spectrogram generated on librosa.

Thank you very much in advance :D

v-iashin commented 2 years ago

Hi, thanks a lot. I am glad you like it 🙂

If I understand you correctly, you would like to reconstruct a Mel spectrogram you obtained from wav file using librosa.

However, the demo (in this cell [is the output of the cell is what you want?]) also extracts mel spectrogram using librosa from the raw audio:

https://github.com/v-iashin/SpecVQGAN/blob/eee222d8351df9b6314db69185d5ce8ca55b50c8/feature_extraction/demo_utils.py#L348-L353 that calls get_spectrogram() that has the implementation: https://github.com/v-iashin/SpecVQGAN/blob/eee222d8351df9b6314db69185d5ce8ca55b50c8/feature_extraction/extract_mel_spectrogram.py#L166-L187

and here are the transforms you need to apply in order to convert the sound samples to Mel spectrogram https://github.com/v-iashin/SpecVQGAN/blob/eee222d8351df9b6314db69185d5ce8ca55b50c8/feature_extraction/extract_mel_spectrogram.py#L141-L151

Just make sure your mel spectrogram is extracted with the same parameters and you apply the same transforms (log, calling etc, see TRANSFORMS(x)).

v-iashin commented 2 years ago

Also, check if the Neural Audio Codec colab demo makes it any clearer

clairerity commented 2 years ago

Hello thank you very much for these! will check them out! :D