jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.97k stars 507 forks source link

Hardcoded num_mels to 80? #166

Open bzp83 opened 5 months ago

bzp83 commented 5 months ago

https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/models.py#L81

Hi, why is 80 hardcoded here? Should it match num_mels?

Thanks

harsh40c commented 5 months ago

Hey bro, i tried this repo code and i encountered the same error. I used librosa instead of tacotron2 for melspectogram generation and my spectograms has shape of (128×387). But since as shown above they hardcoded it to 80 and changing here doesnt solve the error as many other places needed to change so i changed the n_mels to 80 while generating melspectograms from librosa features. This solves this error but now i m getting cuDNN error as the version they used for CUDA and cuDNN are incompatible with GPU (using RTX3090). If we used newer pytorch which correseponds to CUDA 11.1 and cuDNN relevent version, I got kernels error as no available kernel something and using old version gives CUDNN_EXECUTION_FAILED error. If u have any solution regarding this please tell me. As for your querry as i told u change n_mels of spectograms generated to 80 to solve the issue.

bzp83 commented 5 months ago

yes... and to help me get even more confused, vits changes the code of hifi gan slightly and use "initial_channel" (https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/models.py#L249) instead of hardcoded 80... I'm having a hard time figuring it out.

Anyway, yes I solved the problem and it works great on my rtx4090:

1 - update your requirements.txt to the code below, this will install latest version of those packages:

numpy
librosa
scipy
tensorboard
soundfile
matplotlib

2 - install latest pytorch, ie for 2.3.1 and cuda 12.1 do: pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

3 - update your mel_spectrogrammethod in meldataset.py to:

def mel_spectrogram(
    y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False
):
    if torch.min(y) < -1.0:
        print("min value is ", torch.min(y))
    if torch.max(y) > 1.0:
        print("max value is ", torch.max(y))

    global mel_basis, hann_window
    dtype_device = str(y.dtype) + "_" + str(y.device)
    fmax_dtype_device = str(fmax) + "_" + dtype_device
    wnsize_dtype_device = str(win_size) + "_" + dtype_device
    if fmax_dtype_device not in mel_basis:
        mel = librosa_mel_fn(
            sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax
        )
        mel_basis[fmax_dtype_device] = torch.from_numpy(mel).type_as(y)
    if wnsize_dtype_device not in hann_window:
        hann_window[wnsize_dtype_device] = torch.hann_window(win_size).type_as(y)

    y = torch.nn.functional.pad(
        y.unsqueeze(1),
        (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
        mode="reflect",
    )
    y = y.squeeze(1)

    spec = torch.view_as_real(
        torch.stft(
            y,
            n_fft,
            hop_length=hop_size,
            win_length=win_size,
            window=hann_window[wnsize_dtype_device],
            center=center,
            pad_mode="reflect",
            normalized=False,
            onesided=True,
            return_complex=True,
        )
    )

    spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)

    spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
    spec = spectral_normalize_torch(spec)

    return spec

that should do it!

bzp83 commented 5 months ago

btw... I managed to train a model with 128 mels and 44100hz by using the config below. I also had to change that hardcoded 80 to 128 or just do self.conv_pre = weight_norm(Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3)) so I suspect that is indeed num_mels... but as I said, vits use initial_channels, which seems to be 192 all the time in the configs but num_mels is 80 😵

{
    "resblock": "1",
    "num_gpus": 0,
    "batch_size": 8,
    "learning_rate": 0.0002,
    "adam_b1": 0.8,
    "adam_b2": 0.99,
    "lr_decay": 0.999875,
    "seed": 1234,
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_kernel_sizes": [
      16,
      16,
      4,
      4,
      4
    ],
    "upsample_initial_channel": 512,
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "segment_size": 16384,
    "num_mels": 128,
    "num_freq": 1025,
    "n_fft": 2048,
    "hop_size": 512,
    "win_size": 2048,
    "sampling_rate": 44100,
    "fmin": 0,
    "fmax": 22050,
    "fmax_for_loss": null,
    "num_workers": 16,
    "dist_config": {
      "dist_backend": "nccl",
      "dist_url": "tcp://localhost:54321",
      "world_size": 1
    }
  }
harsh40c commented 5 months ago

Hey man, thanks for solution it worked. Just consuming too much GPU memory but since other trainings were going on our server machine i will start its training when GPU is free. Then hope it will train properly. Anyway thanks a bunch