Open bzp83 opened 5 months ago
Hey bro, i tried this repo code and i encountered the same error. I used librosa instead of tacotron2 for melspectogram generation and my spectograms has shape of (128×387). But since as shown above they hardcoded it to 80 and changing here doesnt solve the error as many other places needed to change so i changed the n_mels to 80 while generating melspectograms from librosa features. This solves this error but now i m getting cuDNN error as the version they used for CUDA and cuDNN are incompatible with GPU (using RTX3090). If we used newer pytorch which correseponds to CUDA 11.1 and cuDNN relevent version, I got kernels error as no available kernel something and using old version gives CUDNN_EXECUTION_FAILED error. If u have any solution regarding this please tell me. As for your querry as i told u change n_mels of spectograms generated to 80 to solve the issue.
yes... and to help me get even more confused, vits changes the code of hifi gan slightly and use "initial_channel" (https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/models.py#L249) instead of hardcoded 80... I'm having a hard time figuring it out.
Anyway, yes I solved the problem and it works great on my rtx4090:
1 - update your requirements.txt
to the code below, this will install latest version of those packages:
numpy
librosa
scipy
tensorboard
soundfile
matplotlib
2 - install latest pytorch, ie for 2.3.1 and cuda 12.1 do:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
3 - update your mel_spectrogram
method in meldataset.py
to:
def mel_spectrogram(
y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False
):
if torch.min(y) < -1.0:
print("min value is ", torch.min(y))
if torch.max(y) > 1.0:
print("max value is ", torch.max(y))
global mel_basis, hann_window
dtype_device = str(y.dtype) + "_" + str(y.device)
fmax_dtype_device = str(fmax) + "_" + dtype_device
wnsize_dtype_device = str(win_size) + "_" + dtype_device
if fmax_dtype_device not in mel_basis:
mel = librosa_mel_fn(
sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax
)
mel_basis[fmax_dtype_device] = torch.from_numpy(mel).type_as(y)
if wnsize_dtype_device not in hann_window:
hann_window[wnsize_dtype_device] = torch.hann_window(win_size).type_as(y)
y = torch.nn.functional.pad(
y.unsqueeze(1),
(int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
mode="reflect",
)
y = y.squeeze(1)
spec = torch.view_as_real(
torch.stft(
y,
n_fft,
hop_length=hop_size,
win_length=win_size,
window=hann_window[wnsize_dtype_device],
center=center,
pad_mode="reflect",
normalized=False,
onesided=True,
return_complex=True,
)
)
spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
spec = spectral_normalize_torch(spec)
return spec
that should do it!
btw... I managed to train a model with 128 mels and 44100hz by using the config below. I also had to change that hardcoded 80 to 128 or just do self.conv_pre = weight_norm(Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3))
so I suspect that is indeed num_mels... but as I said, vits use initial_channels, which seems to be 192 all the time in the configs but num_mels is 80 😵
{
"resblock": "1",
"num_gpus": 0,
"batch_size": 8,
"learning_rate": 0.0002,
"adam_b1": 0.8,
"adam_b2": 0.99,
"lr_decay": 0.999875,
"seed": 1234,
"upsample_rates": [
8,
8,
2,
2,
2
],
"upsample_kernel_sizes": [
16,
16,
4,
4,
4
],
"upsample_initial_channel": 512,
"resblock_kernel_sizes": [
3,
7,
11
],
"resblock_dilation_sizes": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"segment_size": 16384,
"num_mels": 128,
"num_freq": 1025,
"n_fft": 2048,
"hop_size": 512,
"win_size": 2048,
"sampling_rate": 44100,
"fmin": 0,
"fmax": 22050,
"fmax_for_loss": null,
"num_workers": 16,
"dist_config": {
"dist_backend": "nccl",
"dist_url": "tcp://localhost:54321",
"world_size": 1
}
}
Hey man, thanks for solution it worked. Just consuming too much GPU memory but since other trainings were going on our server machine i will start its training when GPU is free. Then hope it will train properly. Anyway thanks a bunch
https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/models.py#L81
Hi, why is 80 hardcoded here? Should it match num_mels?
Thanks