No rendezvous handler - Githubissues

siaxace commented 1 year ago

Hello, I've been trying to fine-tune a dataset using your train.py, but unfortunately I've ran into an issue:

2023-03-01 12:01:35.122417: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2023-03-01 12:01:35.122562: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2023-03-01 12:01:40.269370: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2023-03-01 12:01:40.269500: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Traceback (most recent call last): File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\train.py", line 195, in mp.spawn( File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 157, in start_processes while not context.join(): File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 118, in join raise Exception(msg) Exception: -- Process 0 terminated with the following error: Traceback (most recent call last): File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 19, in _wrap fn(i, args) File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\train.py", line 38, in main setup(rank, world_size) File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\train.py", line 28, in setup dist.init_process_group("gloo", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=3600 5)) File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 433, in init_process_group rendezvous_iterator = rendezvous( File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous raise RuntimeError("No rendezvous handler for {}://".format(result.scheme)) RuntimeError: No rendezvous handler for env://

I'd appreciate your help in advance.

khanld commented 1 year ago

Hi, I have not run into this issue before. This may be due to the followings:

No gpu device or cuda not installed yet
bug from the environment, I recommend using anaconda.

Hope they help!

Vào Th 4, 1 thg 3, 2023 lúc 15:59 Siaxace @.***> đã viết:

Hello, I've been trying to fine-tune a dataset using your train.py, but unfortunately I've ran into an issue:

2023-03-01 12:01:35.122417: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2023-03-01 12:01:35.122562: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2023-03-01 12:01:40.269370: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2023-03-01 12:01:40.269500: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Traceback (most recent call last): File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\train.py", line 195, in mp.spawn( File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 157, in start_processes while not context.join(): File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 118, in join raise Exception(msg) Exception: -- Process 0 terminated with the following error: Traceback (most recent call last): File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 19, in _wrap fn(i, args) File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\train.py", line 38, in main setup(rank, world_size) File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\train.py", line 28, in setup dist.init_process_group("gloo", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=3600 5)) File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 433, in init_process_group rendezvous_iterator = rendezvous( File "C:\Users...\PycharmProjects\Speech2Text\ASR-wav2vec-finetune\venv\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous raise RuntimeError("No rendezvous handler for {}://".format(result.scheme)) RuntimeError: No rendezvous handler for env://

I'd appreciate your help in advance.

— Reply to this email directly, view it on GitHub https://github.com/khanld/ASR-Wav2vec-Finetune/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMLBE4SXIBWQVSEVOQFLUFLWZ4FXLANCNFSM6AAAAAAVLYKU3I . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- The contents of this email message and any attachments are intended solely for the addressee(s) and may contain confidential and/or privileged information and may be legally protected from disclosure. If you are not the intended recipient of this message or their agent, or if this message has been addressed to you in error, please immediately alert the sender by reply email and then delete this message and any attachments. If you are not the intended recipient, you are hereby notified that any use, dissemination, copying, or storage of this message or its attachments is strictly prohibited.

siaxace commented 1 year ago

Thanks for your reply, Ill look into it but so far my suspicion is that env handler is not supported for windows.

from site-packages\torch\distributed\rendezvous.py:

... if sys.platform != 'win32': register_rendezvous_handler("tcp", _tcp_rendezvous_handler) register_rendezvous_handler("env", _env_rendezvous_handler) ...

siaxace commented 1 year ago

So I somehow managed to fine-tune my dataset (by training only with cpu), but now I'm facing another issue. After monitoring the values for train_loss, I realized that I had to reduce the number of 'epochs' (in config.toml, from '20' as the default value to '4' for my case) in order to avoid overfitting. Now looking at my files, nothing except for logs, gets saved in saved/ directory and Im not sure where to look for my newly trained model. I have to mention also that when running the train.py for the first time (with 'epochs' = 20), some .zip files were created in saved/checkpoints/ but not anymore. Again I appreciate your help.

khanld commented 1 year ago

Only the best and latest checkpoints are saved in the saved//checkpoints/.tar. I also reserve the best model following the huggingface format in the hunggingface-hub/pytorch_model.bin. You can look at the _save_checkpoint function in the base/base_trainer to understand how it works. Could you try deleting the saved directory and retraining the model to see if it still has the issue? I will take a look at the code cause I have not rerun it for a long time, maybe there are some bugs ^^

siaxace commented 1 year ago

The way I understand it, it only saves the model, checkpoints, etc.. after performing a validation, and not necessarily after when the training is over. So in my case, changing the 'validation_interval' (in config.toml, from 500 to 226) did the trick for me. Thanks!

Shaobo-Z commented 1 year ago

Only CPU not GPU (since apparently theres an issue with Ubuntu running on GPUs)

Yeah, I managed to make it run on the CPU now. I remove the "torch==1.7.1" in the requirement.txt, and install it manually. For example, PyTorch 2.0.1 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118. Then the code work.

However, after the training, I got almost nothing in the "saved/" directory, an empty checkpoints folder, 2 logs in log_dir and 1 .toml. Are you getting the same result? Where should I find the model that I trained? (BTW, for this repo, is it the same as the Fine Tune?)

Also, I think there's something went wrong with my train_wer. I only get 1.0 for all 20 epochs....... You same? 1/1 [==========] - 16s 16s/step - train_loss: 30.0563 - train_lr: 1.3167e-06 - train_grad_norm: 20.8353 - train_wer: 1.0000

Shaobo-Z commented 1 year ago

BTW, are you using an English training dataset to fine-turn the model?

siaxace commented 1 year ago

No, but you can fine-tune your base model (whether English or not) with good datasets.

khanld / ASR-Wav2vec-Finetune

No rendezvous handler #5