Can't fine-tune a model on my dataset in Google Colab

yukiarimo commented 6 months ago

🐛 Describe the bug

I ran the following code:

git clone https://github.com/myshell-ai/MeloTTS.git
cd MeloTTS
pip install -e .
python -m unidic download
cd melo
python preprocess_text.py --metadata all.list
bash train.sh config.json 1

My all.list example:

wavs/29.wav|EN-default|EN|Well, she looks exactly like the one I read about in the book, except she isn't violent at all. Hahaha.
wavs/15.wav|EN-default|EN|It's kind of a rare monster, it's incredibly ferocious!

Log after running the train.sh:

...
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
2024-04-28 17:58:31.390776: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-28 17:58:31.390837: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-28 17:58:31.392462: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-28 17:58:32.734241: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-04-28 17:58:33.501 | INFO     | data_utils:_filter:64 - Init dataset...
100%|█████████████████| 96/96 [00:00<00:00, 21500.06it/s]
2024-04-28 17:58:33.507 | INFO     | data_utils:_filter:84 - min: 1870; max: 1871
2024-04-28 17:58:33.507 | INFO     | data_utils:_filter:85 - skipped: 9, total: 96
Bucket warning  
buckets: []
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
2024-04-28 17:58:33.508 | INFO     | data_utils:_filter:64 - Init dataset...
100%|███████████████████| 4/4 [00:00<00:00, 20763.88it/s]
2024-04-28 17:58:33.509 | INFO     | data_utils:_filter:84 - min: 1870; max: 1871
2024-04-28 17:58:33.509 | INFO     | data_utils:_filter:85 - skipped: 0, total: 4
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([10, 192]), torch.Size([8, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
0it [00:00, ?it/s]/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
0it [00:00, ?it/s]
/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
...

How to fix this? Am I doing something wrong?

Versions

Collecting environment information...

Model name: Intel(R) Xeon(R) CPU @ 2.00GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 3 BogoMIPS: 4000.35 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 128 KiB (4 instances) L1i cache: 128 KiB (4 instances) L2 cache: 4 MiB (4 instances) L3 cache: 38.5 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable; SMT Host state unknown Vulnerability Meltdown: Vulnerable Vulnerability Mmio stale data: Vulnerable Vulnerability Retbleed: Vulnerable Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Vulnerable

Versions of relevant libraries: [pip3] numpy==1.25.2 [pip3] torch==1.13.1 [pip3] torchaudio==0.13.1 [pip3] torchdata==0.7.1 [pip3] torchsummary==1.5.1 [pip3] torchtext==0.17.1 [pip3] torchvision==0.17.1+cu121 [pip3] triton==2.2.0 [conda] Could not collect

28065467 commented 6 months ago

Same problem but in Ubuntu

farconada commented 5 months ago

same problem inside docker image

s-tweed commented 5 months ago

It was a few days ago but I believe I had the some, windows 11 python 3.10. I was able to infer on the pretrained weights, but after spending hours tweaking the training script I was ably able to do 1 iteration in torchrun and basically nothing happened. Would love to be able to fine-tune...

olgakuak commented 5 months ago

same issue ubuntu python 3.9

yukiarimo commented 4 months ago

@s-tweed Any updates?

rgxb2807 commented 2 months ago

also hitting a similar issue using the docker image.

Dolyfin commented 1 month ago

I had this issue. I believe resampling from 48000 to 44100 fixed it for me. Possible check if your dataset is in the correct sample rate.

julia-imlauer commented 1 month ago

You also might want to check how many channels the file has. I had an issue with stereo files, after converting it to mono the training started without errors.

myshell-ai / MeloTTS

Can't fine-tune a model on my dataset in Google Colab #116

🐛 Describe the bug

Versions