deepspeed单机多卡lora 70b oom

bryant03 commented 2 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

absl-py 1.4.0 accelerate 0.32.0 addict 2.4.0 aiofiles 22.1.0 aiohttp 3.8.4 aiosignal 1.3.1 aiosqlite 0.18.0 aliyun-python-sdk-core 2.15.0 aliyun-python-sdk-kms 2.16.2 altair 5.0.1 annotated-types 0.7.0 anyio 3.6.2 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 arrow 1.2.3 asttokens 2.2.1 async-timeout 4.0.2 attrs 22.2.0 Babel 2.12.1 backcall 0.2.0 backports.zoneinfo 0.2.1 beautifulsoup4 4.11.2 bitsandbytes 0.39.0 bleach 6.0.0 blinker 1.7.0 boto3 1.29.1 botocore 1.32.1 braceexpand 0.1.7 brotlipy 0.7.0 cachetools 5.3.0 certifi 2021.5.30 cffi 1.14.6 chardet 4.0.0 charset-normalizer 3.1.0 click 8.1.3 cmake 3.26.0 comm 0.1.2 conda 4.10.3 conda-package-handling 1.7.3 contourpy 1.0.7 cpm-kernels 1.0.11 crcmod 1.7 cryptography 3.4.7 cycler 0.11.0 datasets 2.18.0 debugpy 1.6.6 decorator 5.1.1 deepspeed 0.14.4 defusedxml 0.7.1 dill 0.3.6 distro 1.9.0 docstring-parser 0.15 einops 0.7.0 et-xmlfile 1.1.0 executing 1.2.0 fastapi 0.112.2 fastjsonschema 2.16.3 ffmpy 0.3.0 filelock 3.10.0 fire 0.6.0 fonttools 4.39.0 fqdn 1.5.1 frozenlist 1.3.3 fsspec 2023.6.0 gast 0.5.4 gitdb 4.0.11 GitPython 3.1.42 google-auth 2.16.2 google-auth-oauthlib 0.4.6 gradio 4.42.0 gradio_client 1.3.0 grpcio 1.51.3 h11 0.14.0 hjson 3.1.0 httpcore 0.17.2 httpx 0.24.1 huggingface-hub 0.24.6 idna 2.10 importlib-metadata 7.0.1 importlib-resources 5.12.0 ipykernel 6.21.3 ipython 8.11.0 ipython-genutils 0.2.0 ipywidgets 8.0.4 isoduration 20.11.0 jedi 0.18.2 jieba 0.42.1 Jinja2 3.1.2 jiter 0.5.0 jmespath 0.10.0 joblib 1.3.1 json5 0.9.11 jsonpointer 2.3 jsonschema 4.17.3 jupyter_client 8.0.3 jupyter_core 5.2.0 jupyter-events 0.6.3 jupyter_server 2.4.0 jupyter_server_fileid 0.8.0 jupyter_server_terminals 0.4.4 jupyter_server_ydoc 0.6.1 jupyter-ydoc 0.2.3 jupyterlab 3.6.1 jupyterlab-language-pack-zh-CN 3.6.post0 jupyterlab-pygments 0.2.2 jupyterlab_server 2.20.0 jupyterlab-widgets 3.0.5 kiwisolver 1.4.4 latex2mathml 3.76.0 linkify-it-py 2.0.2 lit 15.0.7 llamafactory 0.8.4.dev0 /root/autodl-tmp/LLaMA-Factory-0825/src llmtuner 0.7.1 loguru 0.7.0 Markdown 3.4.1 markdown-it-py 2.2.0 MarkupSafe 2.1.2 matplotlib 3.7.1 matplotlib-inline 0.1.6 mdit-py-plugins 0.3.3 mdtex2html 1.2.0 mdurl 0.1.2 mistune 2.0.5 modelscope 1.12.0 mpi4py 4.0.0 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 nbclassic 0.5.3 nbclient 0.7.2 nbconvert 7.2.10 nbformat 5.7.3 nest-asyncio 1.5.6 networkx 3.0 ninja 1.11.1 nltk 3.8.1 notebook 6.5.3 notebook_shim 0.2.2 Nuitka 1.6.5 numpy 1.24.2 nvidia-cublas-cu11 11.10.3.66 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu11 8.5.0.96 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu11 10.9.0.58 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu11 10.2.10.91 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu11 11.7.4.91 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu11 2.14.3 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.20 nvidia-nvtx-cu11 11.7.91 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 openai 1.42.0 openpyxl 3.1.2 ordered-set 4.1.0 orjson 3.9.1 oss2 2.18.4 packaging 23.0 pandas 2.0.2 pandocfilters 1.5.0 parso 0.8.3 peft 0.12.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.0.1 pip 24.2 pkgutil_resolve_name 1.3.10 platformdirs 4.2.0 prometheus-client 0.16.0 prompt-toolkit 3.0.38 protobuf 4.23.3 psutil 5.9.4 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 12.0.1 pyarrow-hotfix 0.6 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycosat 0.6.3 pycparser 2.20 pycryptodome 3.20.0 pydantic 2.8.2 pydantic_core 2.20.1 pydeck 0.8.1b0 pydub 0.25.1 Pygments 2.14.0 pynvml 11.5.3 pyOpenSSL 20.0.1 pyparsing 3.0.9 pyrsistent 0.19.3 PySocks 1.7.1 python-dateutil 2.8.2 python-json-logger 2.0.7 python-multipart 0.0.9 pytz 2022.7.1 PyYAML 6.0 pyzmq 25.0.1 regex 2023.6.3 requests 2.32.3 requests-oauthlib 1.3.1 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.7.0 rouge-chinese 1.0.3 rsa 4.9 ruamel-yaml-conda 0.15.100 ruff 0.6.2 s3transfer 0.7.0 safetensors 0.4.2 scikit-learn 1.3.2 scipy 1.10.1 seaborn 0.13.0 semantic-version 2.10.0 Send2Trash 1.8.0 sentencepiece 0.1.99 setuptools 74.0.0 shellingham 1.5.4 shtab 1.7.0 simplejson 3.19.2 six 1.16.0 smmap 5.0.1 sniffio 1.3.0 sortedcontainers 2.4.0 soupsieve 2.4 sse-starlette 1.6.1 stack-data 0.6.2 starlette 0.38.2 streamlit 1.31.1 streamlit-chat 0.1.1 supervisor 4.2.5 SwissArmyTransformer 0.4.8 sympy 1.11.1 tenacity 8.2.3 tensorboard 2.12.0 tensorboard-data-server 0.7.0 tensorboard-plugin-wit 1.8.1 tensorboardX 2.6.2.2 termcolor 2.4.0 terminado 0.17.1 threadpoolctl 3.5.0 tiktoken 0.5.1 tinycss2 1.2.1 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.0 torch 2.4.0 torchvision 0.14.1 tornado 6.2 tqdm 4.65.0 traitlets 5.9.0 transformers 4.43.4 transformers-stream-generator 0.0.4 triton 3.0.0 trl 0.9.6 typer 0.12.5 typing_extensions 4.12.2 tyro 0.7.3 tzdata 2023.3 tzlocal 5.2 uc-micro-py 1.0.2 uri-template 1.2.0 urllib3 2.2.2 uvicorn 0.22.0 validators 0.22.0 watchdog 4.0.0 wcwidth 0.2.6 webcolors 1.12 webdataset 0.2.77 webencodings 0.5.1 websocket-client 1.5.1 websockets 11.0.3 Werkzeug 2.2.3 wheel 0.44.0 widgetsnbextension 4.0.5 xxhash 3.2.0 y-py 0.5.9 yapf 0.40.2 yarl 1.9.2 ypy-websocket 0.8.2 zipp 3.15.0 zstandard 0.21.0

Reproduction

FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3-70b.yaml

yaml文件内容如下：

model_name_or_path: /root/autodl-tmp/LLM-Research/Meta-Llama-3___1-70B-Instruct

stage: sft do_train: true finetuning_type: lora lora_target: all deepspeed: examples/deepspeed/ds_z3_config.json

dataset: suicide_train_data template: llama3 cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16

output_dir: saves/llama3-70b/lora/sft logging_steps: 10 save_steps: 10 plot_loss: true overwrite_output_dir: true

per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

examples/deepspeed/ds_z3_config.json文件内容如下 { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }

Expected behavior

依然显存溢出，按理说A800*2=160G，微调70b模型没问题，但过程如下： /root/miniconda3/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}") [2024-08-29 01:04:40,056] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): Traceback (most recent call last): File "/root/miniconda3/bin/llamafactory-cli", line 8, in sys.exit(main()) File "/root/miniconda3/lib/python3.8/site-packages/llmtuner/cli.py", line 75, in main raise NotImplementedError("Unknown command: {}".format(command)) NotImplementedError: Unknown command: env root@autodl-container-1c3542b085-2c059f9b:~/autodl-tmp/LLaMA-Factory-0825# FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3-70b.yaml /root/miniconda3/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}") [2024-08-29 01:18:17,599] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): 08/29/2024 01:18:20 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 2, distributed training: False, compute dtype: torch.bfloat16 [INFO|tokenization_utils_base.py:2287] 2024-08-29 01:18:20,537 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2287] 2024-08-29 01:18:20,537 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2287] 2024-08-29 01:18:20,537 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2287] 2024-08-29 01:18:20,537 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2533] 2024-08-29 01:18:20,801 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 08/29/2024 01:18:20 - INFO - llmtuner.data.template - Replace eos token: <|eot_id|> 08/29/2024 01:18:20 - INFO - llmtuner.data.template - Add pad token: <|eot_id|> 08/29/2024 01:18:20 - INFO - llmtuner.data.loader - Loading dataset suicide_train_data.json... Converting format of dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████| 400/400 [00:00<00:00, 2656.59 examples/s] Running tokenizer on dataset (num_proc=16): 100%|████████████████████████████████████████████████████████████████████████████████| 400/400 [00:02<00:00, 142.72 examples/s] input_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 18328, 13, 128009, 128006, 882, 128007, 271, 1999, 1463, 264, 27728, 1772, 1139, 3116, 18639, 5326, 5990, 25, 21070, 11, 2679, 367, 11, 7865, 11, 323, 4879, 627, 40, 617, 5675, 4395, 3062, 311, 757, 13, 2052, 358, 656, 374, 6678, 856, 4885, 3201, 11, 4395, 358, 2019, 44164, 1124, 13, 358, 649, 956, 656, 4205, 1314, 13, 358, 649, 956, 387, 264, 1695, 4333, 13, 358, 649, 956, 387, 264, 1695, 1716, 311, 856, 6699, 13, 358, 649, 956, 387, 264, 1695, 6800, 3124, 311, 856, 83777, 13, 358, 649, 956, 387, 264, 1695, 3187, 369, 856, 14992, 61007, 13, 358, 649, 956, 387, 264, 1695, 5575, 369, 856, 13639, 13, 358, 649, 956, 387, 1695, 520, 11039, 3953, 13, 358, 649, 956, 387, 1695, 520, 22019, 3953, 13, 358, 649, 956, 387, 1695, 520, 1989, 13, 358, 649, 956, 387, 1695, 520, 4205, 1606, 358, 2846, 9615, 29948, 11, 682, 358, 2846, 5505, 369, 374, 1694, 264, 1886, 438, 8334, 13, 358, 649, 956, 1524, 387, 264, 1695, 8334, 1606, 358, 2846, 1380, 13, 358, 649, 956, 656, 4205, 1314, 13, 358, 3077, 3940, 311, 1456, 323, 6917, 588, 7182, 11, 520, 3325, 430, 1749, 374, 29493, 323, 2834, 956, 387, 439, 26175, 439, 21363, 7182, 11, 358, 4265, 10932, 3515, 2680, 311, 264, 6166, 3582, 11, 520, 3325, 430, 1053, 387, 39392, 11, 6784, 1752, 11, 323, 810, 25114, 13, 358, 4265, 617, 264, 2731, 2035, 304, 23070, 11, 520, 3325, 358, 3358, 387, 3025, 311, 1518, 856, 18233, 1578, 13, 2052, 279, 1520, 358, 1436, 636, 3250, 956, 1524, 1390, 311, 3137, 311, 757, 14926, 13, 358, 2846, 264, 77741, 13, 358, 649, 956, 656, 4205, 1314, 358, 2846, 779, 29948, 11, 422, 358, 2567, 78501, 7182, 358, 3358, 520, 3325, 387, 5710, 555, 279, 842, 315, 279, 2046, 13, 358, 1120, 1390, 856, 4648, 311, 2586, 10819, 4708, 18433, 701, 4885, 323, 3070, 1053, 3194, 499, 6, 374, 539, 264, 1695, 2944, 311, 2567, 5496, 13, 128009, 128006, 78191, 128007, 271, 56966, 128009] inputs: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Classify a Reddit post into four suicide risk levels: indicator, ideation, behavior, and attempt. I have lost everything important to me. All I do is drive my friends away, everything I say hurts them. I can't do anything right. I can't be a good friend. I can't be a good child to my parents. I can't be a good grandchild to my grandma. I can't be a good example for my younger cousins. I can't be a good student for my teachers. I can't be good at fighting games. I can't be good at racing games. I can't be good at art. I can't be good at anything because I'm truly useless, all I'm useful for is being a errand boy. I can't even be a good boy because I'm trans. I can't do anything right. I've started to try and starve myself, at least that method is slower and won't be as painful as hanging myself, I'd prefer having access to a gun though, at least that would be quicker, painless, and more deadly. I'd have a better place in heaven, at least I'll be able to see my dad again. All the help I could get doesn't even want to talk to me anymore. I'm a nuisance. I can't do anything right I'm so useless, if I keep starving myself I'll at least be dead by the end of the week. I just want my death to come faster''Because your friends and family would miss you' is not a good reason to keep living.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

behavior<|eot_id|> label_ids: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 56966, 128009] labels: behavior<|eot_id|> [INFO|configuration_utils.py:731] 2024-08-29 01:18:25,299 >> loading configuration file /root/autodl-tmp/LLM-Research/Meta-Llama-3_1-70B-Instruct/config.json [INFO|configuration_utils.py:800] 2024-08-29 01:18:25,301 >> Model config LlamaConfig { "_name_orpath": "/root/autodl-tmp/LLM-Research/Meta-Llama-31-70B-Instruct", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": [ 128001, 128008, 128009 ], "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 28672, "max_position_embeddings": 131072, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 8.0, "high_freq_factor": 4.0, "low_freq_factor": 1.0, "original_max_position_embeddings": 8192, "rope_type": "llama3" }, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "use_cache": true, "vocab_size": 128256 }

[INFO|modeling_utils.py:3641] 2024-08-29 01:18:25,321 >> loading weights file /root/autodl-tmp/LLM-Research/Meta-Llama-3___1-70B-Instruct/model.safetensors.index.json [INFO|modeling_utils.py:1572] 2024-08-29 01:18:25,322 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [INFO|modeling_utils.py:3786] 2024-08-29 01:18:25,322 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model [2024-08-29 01:18:25,323] [INFO] [comm.py:637:init_distributed] cdb=None [2024-08-29 01:18:25,323] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2024-08-29 01:18:25,706] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=172.17.0.5, master_port=29500 [2024-08-29 01:18:25,706] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [INFO|configuration_utils.py:1038] 2024-08-29 01:18:25,717 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": [ 128001, 128008, 128009 ] }

[2024-08-29 01:18:26,527] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 429, num_elems = 41.65B rank0: Traceback (most recent call last): rank0: File "/root/miniconda3/bin/llamafactory-cli", line 8, in

rank0: File "/root/miniconda3/lib/python3.8/site-packages/llmtuner/cli.py", line 65, in main

rank0: File "/root/miniconda3/lib/python3.8/site-packages/llmtuner/train/tuner.py", line 33, in run_exp rank0: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) rank0: File "/root/miniconda3/lib/python3.8/site-packages/llmtuner/train/sft/workflow.py", line 34, in run_sft rank0: model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) rank0: File "/root/miniconda3/lib/python3.8/site-packages/llmtuner/model/loader.py", line 135, in load_model rank0: model = AutoModelForCausalLM.from_pretrained(init_kwargs) rank0: File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained rank0: return model_class.from_pretrained( rank0: File "/root/miniconda3/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3798, in from_pretrained rank0: model = cls(config, *model_args, *model_kwargs) rank0: File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper rank0: f(module, args, kwargs) rank0: File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 1068, in init rank0: self.model = LlamaModel(config) rank0: File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper rank0: f(module, *args, kwargs) rank0: File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 845, in init rank0: LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers): File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 845, in rank0: LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers): File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper rank0: f(module, *args, *kwargs) rank0: File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 634, in init rank0: self.mlp = LlamaMLP(config) rank0: File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper rank0: f(module, args, kwargs) rank0: File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 230, in init rank0: self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias) rank0: File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 516, in wrapper

rank0: File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1081, in _post_init_method

rank0: File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1040, in _zero_init_param

rank0: File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1375, in partition rank0: self._partition(param_list, has_been_updated=has_been_updated) rank0: File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1524, in _partition rank0: self._partition_param(param, has_been_updated=has_been_updated) rank0: File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn rank0: ret_val = func(*args, *kwargs) rank0: File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1589, in _partition_param rank0: partitioned_tensor = torch.empty(partition_size, dtype=param.dtype, device=device) rank0: File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 240, in wrapped_fn rank0: tensor: Tensor = fn(args, **kwargs) rank0: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 42.75 MiB is free. Process 666325 has 79.09 GiB memory in use. Of the allocated memory 77.58 GiB is allocated by PyTorch, and 724.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) rank0:[W829 01:18:27.645524036 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())