cuFFT error: CUFFT_INTERNAL_ERROR from training script

johnPertoft commented 3 months ago

I'm trying to run the provided training script but I'm running into the aforementioned problem. It happens in a call to torch.stft(...) in melo/mel_processing.py:mel_spectrogram_torch(...). All tensors going into this function seem to be on the correct device. Have you seen this error before and if so, are there any known fixes?

System:

Ubuntu
Nvidia L40s GPU (545.29.06 driver)

Verified that torch can talk to gpu:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.version.cuda)"
1.13.1+cu117
True
11.7

config.json:

Show

```json { "train": { "log_interval": 200, "eval_interval": 1000, "seed": 52, "epochs": 10, "learning_rate": 0.0003, "betas": [ 0.8, 0.99 ], "eps": 1e-09, "batch_size": 16, "fp16_run": false, "lr_decay": 0.999875, "segment_size": 16384, "init_lr_ratio": 1, "warmup_epochs": 0, "c_mel": 45, "c_kl": 1.0, "skip_optimizer": true }, "data": { "training_files": "../../data/June-Restrained/train.list", "validation_files": "../../data/June-Restrained/val.list", "max_wav_value": 32768.0, "sampling_rate": 44100, "filter_length": 2048, "hop_length": 512, "win_length": 2048, "n_mel_channels": 128, "mel_fmin": 0.0, "mel_fmax": null, "add_blank": true, "n_speakers": 1, "cleaned_text": true, "spk2id": { "June": 0 } }, "model": { "use_spk_conditioned_encoder": true, "use_noise_scaled_mas": true, "use_mel_posterior_encoder": false, "use_duration_discriminator": true, "inter_channels": 192, "hidden_channels": 192, "filter_channels": 768, "n_heads": 2, "n_layers": 6, "n_layers_trans_flow": 3, "kernel_size": 3, "p_dropout": 0.1, "resblock": "1", "resblock_kernel_sizes": [ 3, 7, 11 ], "resblock_dilation_sizes": [ [ 1, 3, 5 ], [ 1, 3, 5 ], [ 1, 3, 5 ] ], "upsample_rates": [ 8, 8, 2, 2, 2 ], "upsample_initial_channel": 512, "upsample_kernel_sizes": [ 16, 16, 8, 2, 2 ], "n_layers_q": 3, "use_spectral_norm": false, "gin_channels": 256 }, "num_languages": 8, "num_tones": 16, "symbols": [ "_", "\"", "(", ")", "*", "/", ":", "AA", "E", "EE", "En", "N", "OO", "Q", "V", "[", "\\", "]", "^", "a", "a:", "aa", "ae", "ah", "ai", "an", "ang", "ao", "aw", "ay", "b", "by", "c", "ch", "d", "dh", "dy", "e", "e:", "eh", "ei", "en", "eng", "er", "ey", "f", "g", "gy", "h", "hh", "hy", "i", "i0", "i:", "ia", "ian", "iang", "iao", "ie", "ih", "in", "ing", "iong", "ir", "iu", "iy", "j", "jh", "k", "ky", "l", "m", "my", "n", "ng", "ny", "o", "o:", "ong", "ou", "ow", "oy", "p", "py", "q", "r", "ry", "s", "sh", "t", "th", "ts", "ty", "u", "u:", "ua", "uai", "uan", "uang", "uh", "ui", "un", "uo", "uw", "v", "van", "ve", "vn", "w", "x", "y", "z", "zh", "zy", "~", "æ", "ç", "ð", "ø", "ŋ", "œ", "ɐ", "ɑ", "ɒ", "ɔ", "ɕ", "ə", "ɛ", "ɜ", "ɡ", "ɣ", "ɥ", "ɦ", "ɪ", "ɫ", "ɬ", "ɭ", "ɯ", "ɲ", "ɵ", "ɸ", "ɹ", "ɾ", "ʁ", "ʃ", "ʊ", "ʌ", "ʎ", "ʏ", "ʑ", "ʒ", "ʝ", "ʲ", "ˈ", "ˌ", "ː", "̃", "̩", "β", "θ", "ᄀ", "ᄁ", "ᄂ", "ᄃ", "ᄄ", "ᄅ", "ᄆ", "ᄇ", "ᄈ", "ᄉ", "ᄊ", "ᄋ", "ᄌ", "ᄍ", "ᄎ", "ᄏ", "ᄐ", "ᄑ", "ᄒ", "ᅡ", "ᅢ", "ᅣ", "ᅤ", "ᅥ", "ᅦ", "ᅧ", "ᅨ", "ᅩ", "ᅪ", "ᅫ", "ᅬ", "ᅭ", "ᅮ", "ᅯ", "ᅰ", "ᅱ", "ᅲ", "ᅳ", "ᅴ", "ᅵ", "ᆨ", "ᆫ", "ᆮ", "ᆯ", "ᆷ", "ᆸ", "ᆼ", "ㄸ", "!", "?", "…", ",", ".", "'", "-", "¿", "¡", "SP", "UNK" ] } ```

Pip freeze:

Show

``` absl-py==2.1.0 aiofiles==23.2.1 altair==5.2.0 annotated-types==0.6.0 anyascii==0.3.2 anyio==4.3.0 asttokens==2.4.1 attrs==23.2.0 audioread==3.0.1 Babel==2.14.0 boto3==1.34.67 botocore==1.34.67 cached-path==1.6.2 cachetools==5.3.3 certifi==2024.2.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 cn2an==0.5.22 colorama==0.4.6 contourpy==1.2.0 cycler==0.12.1 dateparser==1.1.8 decorator==5.1.1 Deprecated==1.2.14 Distance==0.1.3 distro==1.9.0 docopt==0.6.2 eng-to-ipa==0.0.2 exceptiongroup==1.2.0 executing==2.0.1 fastapi==0.110.0 fastcore==1.5.29 ffmpy==0.3.2 filelock==3.13.1 fonttools==4.50.0 fsspec==2024.3.1 fugashi==1.3.0 g2p-en==2.1.0 g2pkk==0.1.2 google-api-core==2.17.1 google-auth==2.29.0 google-cloud-core==2.4.1 google-cloud-storage==2.16.0 google-crc32c==1.5.0 google-resumable-media==2.7.0 googleapis-common-protos==1.63.0 gradio-client==0.13.0 gradio==4.22.0 grpcio==1.62.1 gruut-ipa==0.13.0 gruut-lang-de==2.0.0 gruut-lang-en==2.0.0 gruut-lang-es==2.0.0 gruut-lang-fr==2.0.2 gruut==2.2.3 h11==0.14.0 httpcore==1.0.4 httpx==0.27.0 huggingface-hub==0.21.4 idna==3.6 importlib-metadata==7.1.0 importlib-resources==6.4.0 inflect==7.0.0 ipython==8.18.1 jaconv==0.3.4 jamo==0.4.1 jedi==0.19.1 jieba==0.42.1 Jinja2==3.1.3 jmespath==1.0.1 joblib==1.3.2 jsonlines==1.2.0 jsonschema-specifications==2023.12.1 jsonschema==4.21.1 kiwisolver==1.4.5 langid==1.1.6 librosa==0.9.1 llvmlite==0.42.0 loguru==0.7.2 lovely-numpy==0.2.11 lovely-tensors==0.1.15 markdown-it-py==3.0.0 Markdown==3.6 MarkupSafe==2.1.5 matplotlib-inline==0.1.6 matplotlib==3.8.3 mdurl==0.1.2 mecab-python3==1.0.5 melotts @ file:///workspaces/tts-finetuning/MeloTTS networkx==2.8.8 nltk==3.8.1 num2words==0.5.12 numba==0.59.1 numpy==1.26.4 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 openai==1.14.2 orjson==3.9.15 packaging==24.0 pandas==2.2.1 parso==0.8.3 pexpect==4.9.0 pillow==10.2.0 pip==24.0 plac==1.4.3 platformdirs==4.2.0 pooch==1.8.1 proces==0.1.7 prompt-toolkit==3.0.43 protobuf==4.25.3 ptyprocess==0.7.0 pure-eval==0.2.2 pyasn1-modules==0.3.0 pyasn1==0.5.1 pycparser==2.21 pydantic-core==2.16.3 pydantic==2.6.4 pydub==0.25.1 Pygments==2.17.2 pykakasi==2.2.1 pyparsing==3.1.2 pypinyin==0.50.0 python-crfsuite==0.9.10 python-dateutil==2.9.0.post0 python-multipart==0.0.9 pytz==2024.1 PyYAML==6.0.1 referencing==0.34.0 regex==2023.12.25 requests==2.31.0 resampy==0.4.3 rich==13.7.1 rpds-py==0.18.0 rsa==4.9 ruff==0.3.3 s3transfer==0.10.1 scikit-learn==1.4.1.post1 scipy==1.12.0 semantic-version==2.10.0 setuptools==69.2.0 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 soundfile==0.12.1 stack-data==0.6.3 starlette==0.36.3 tensorboard-data-server==0.7.2 tensorboard==2.16.2 threadpoolctl==3.4.0 tokenizers==0.13.3 tomlkit==0.12.0 toolz==0.12.1 torch==1.13.1 torchaudio==0.13.1 tqdm==4.66.2 traitlets==5.14.2 transformers==4.27.4 txtsplit==1.0.0 typer==0.9.0 typing-extensions==4.10.0 tzdata==2024.1 tzlocal==5.2 Unidecode==1.3.7 unidic-lite==1.0.8 unidic==1.1.0 urllib3==1.26.18 uvicorn==0.29.0 wasabi==0.10.1 wcwidth==0.2.13 websockets==11.0.3 Werkzeug==3.0.1 wheel==0.43.0 wrapt==1.16.0 zipp==3.18.1 ```

AngelGuevara7 commented 3 months ago

In my case, the training script is working with torch 2.0.1+cu118, nvidia drivers 530 and rtx 4090. Have you tried changing torch version?

johnPertoft commented 3 months ago

I used the version pinned by the authors but I will try upgrading then. Thanks!

johnPertoft commented 2 months ago

Just updating on this, it does seem to work fine with the latest pytorch version. But would be good if the authors could weigh in on whether the torch<2.0 listed in the requirements is intended and if there are any implications of using a 2+ version of torch.

@Zengyi-Qin

myshell-ai / MeloTTS

cuFFT error: CUFFT_INTERNAL_ERROR from training script #80