mims-harvard / UniTS

A unified multi-task time series model.
https://zitniklab.hms.harvard.edu/projects/UniTS/
MIT License
363 stars 45 forks source link

RuntimeError: Distributed package doesn't have NCCL built in #9

Open Balu027 opened 3 months ago

Balu027 commented 3 months ago

Hello,

I tried to run UniTS_supervised with all default settings just for an initial test, but I got this error below. It seems that Torch is missing something, but I didn't see NCCL mentioned anywhere, I just installed everything in requirements.txt. I tried to install NCCL, but it seems to me that it's Linux only. Do you have an idea how to solve this on Windows 10?

C:\Users\comp\UniTS>bash ./scripts/supervised_learning/UniTS_supervised.sh NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [comp]:4223 (system error: 10049 - A kÚrt cÝm nem ÚrvÚnyes a hozzß tartozˇ k÷rnyezetben.). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [comp]:4223 (system error: 10049 - A kÚrt cÝm nem ÚrvÚnyes a hozzß tartozˇ k÷rnyezetben.). C:\Users\comp\AppData\Local\Programs\Python\Python310\lib\site-packages\gluonts\json.py:101: UserWarning: Using json-module for json-handling. Consider installing one of orjson, ujson to speed up serialization and deserialization. warnings.warn( [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [comp]:4223 (system error: 10049 - A kÚrt cÝm nem ÚrvÚnyes a hozzß tartozˇ k÷rnyezetben.). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [comp]:4223 (system error: 10049 - A kÚrt cÝm nem ÚrvÚnyes a hozzß tartozˇ k÷rnyezetben.). Traceback (most recent call last): File "C:\Users\comp\UniTS\run.py", line 114, in init_distributed_mode(args) File "C:\Users\comp\UniTS\utils\ddp.py", line 31, in init_distributed_mode dist.init_process_group( File "C:\Users\comp\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group default_pg = _new_process_group_helper( File "C:\Users\comp\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 85344) of binary: C:\Users\comp\AppData\Local\Programs\Python\Python310\python.exe Traceback (most recent call last): File "C:\Users\comp\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\comp\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\comp\AppData\Local\Programs\Python\Python310\Scripts\torchrun.exe__main.py", line 7, in File "C:\Users\comp\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\elastic\multiprocessing\errors__init__.py", line 346, in wrapper return f(*args, **kwargs) File "C:\Users\comp\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\run.py", line 794, in main run(args) File "C:\Users\comp\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "C:\Users\comp\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call__ return launch_agent(self._config, self._entrypoint, list(args)) File "C:\Users\comp\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-30_11:31:28 host : comp rank : 0 (local_rank: 0) exitcode : 1 (pid: 85344) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
gasvn commented 3 months ago

We only test our code on linux env. This seems like a NCCL env problem.