togethercomputer / OpenChatKit

Apache License 2.0
9k stars 1.01k forks source link

ProcessGroupGloo RuntimeError: Socket Timeout #95

Closed susery closed 1 year ago

susery commented 1 year ago

bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh

Traceback (most recent call last): File "/home/xtr/git_data/OpenChatKit/training/dist_clm_train.py", line 358, in main() File "/home/xtr/git_data/OpenChatKit/training/dist_clm_train.py", line 275, in main init_communicators(args) File "/home/xtr/git_data/OpenChatKit/training/comm/comm_utils.py", line 85, in init_communicators default_init(args) File "/home/xtr/git_data/OpenChatKit/training/comm/comm_utils.py", line 81, in default_init dist.init_process_group(backend='gloo', timeout=datetime.timedelta(seconds=2*60), init_method=args.dist_url, world_size=args.world_size, rank=args.rank) File "/usr/local/python3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group default_pg = _new_process_group_helper( File "/usr/local/python3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 994, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Socket Timeout

Desktop (please complete the following information):

susery commented 1 year ago

This is my python lib list: Package Version


aiofiles 23.1.0 aiohttp 3.8.4 aiosignal 1.3.1 alembic 1.10.3 altair 4.2.2 anyio 3.6.2 appdirs 1.4.4 async-timeout 4.0.2 attrs 22.2.0 banal 1.0.6 bitsandbytes 0.38.1 bottle 0.12.25 bz2file 0.98 certifi 2022.12.7 charset-normalizer 2.1.1 click 8.1.3 cmake 3.25.0 contourpy 1.0.7 cpm-kernels 1.0.11 cupy-cuda12x 12.0.0 cycler 0.11.0 datasets 2.11.0 dill 0.3.6 docker-pycreds 0.4.0 entrypoints 0.4 faiss-gpu 1.7.2 fastapi 0.95.0 fastrlock 0.8.1 ffmpy 0.3.0 filelock 3.9.0 fonttools 4.39.3 frozenlist 1.3.3 fsspec 2023.4.0 gitdb 4.0.10 GitPython 3.1.31 gradio 3.25.0 gradio_client 0.1.0 greenlet 2.0.2 h11 0.14.0 httpcore 0.17.0 httpx 0.24.0 huggingface-hub 0.13.4 idna 3.4 Jinja2 3.1.2 jsonschema 4.17.3 kiwisolver 1.4.4 latex2mathml 3.75.2 linkify-it-py 2.0.0 lit 15.0.7 loguru 0.7.0 Mako 1.2.4 Markdown 3.4.3 markdown-it-py 2.2.0 MarkupSafe 2.1.2 matplotlib 3.7.1 mdit-py-plugins 0.3.3 mdtex2html 1.2.0 mdurl 0.1.2 mpmath 1.2.1 multidict 6.0.4 multiprocess 0.70.14 netifaces 0.11.0 networkx 3.0 numpy 1.24.1 orjson 3.8.10 packaging 23.1 pandas 2.0.0 pathtools 0.1.2 Pillow 9.3.0 pip 23.0.1 prompt-toolkit 3.0.38 protobuf 4.22.3 psutil 5.9.4 pyarrow 11.0.0 pydantic 1.10.7 pydub 0.25.1 pyparsing 3.0.9 pyrsistent 0.19.3 python-dateutil 2.8.2 python-multipart 0.0.6 pytz 2023.3 PyYAML 6.0 regex 2023.3.23 requests 2.28.1 responses 0.18.0 rwkv 0.7.3 semantic-version 2.10.0 sentencepiece 0.1.98 sentry-sdk 1.19.1 setproctitle 1.3.2 setuptools 65.5.0 six 1.16.0 smmap 5.0.0 sniffio 1.3.0 SQLAlchemy 1.4.47 SQLAlchemy-Utils 0.41.0 starlette 0.26.1 sympy 1.11.1 tokenizers 0.13.3 toolz 0.12.0 torch 2.0.0+cu118 torchaudio 2.0.1+cu118 torchsummary 1.5.1 torchvision 0.15.1+cu118 tqdm 4.65.0 transformers 4.27.1 triton 2.0.0 typing_extensions 4.4.0 tzdata 2023.3 uc-micro-py 1.0.1 urllib3 1.26.13 uvicorn 0.21.1 wandb 0.14.2 wcwidth 0.2.6 websockets 11.0.1 Whoosh 2.7.4 xxhash 3.2.0 yarl 1.8.2 zstandard 0.20.0

susery commented 1 year ago

GPT-NeoXT-Chat-Base-20B is not excuting on single gpu machine