Closed qingqiuhe closed 1 year ago
返回码-7,但没有输出任何错误详情。
# pip list Package Version ------------------------- ----------- absl-py 1.4.0 accelerate 0.21.0 aiofiles 23.1.0 aiohttp 3.8.4 aiosignal 1.3.1 altair 5.0.1 anyio 3.7.1 appdirs 1.4.4 asttokens 2.0.5 astunparse 1.6.3 async-timeout 4.0.2 attrs 22.2.0 backcall 0.2.0 beautifulsoup4 4.11.1 bitsandbytes 0.37.1 black 23.3.0 brotlipy 0.7.0 cachetools 5.3.1 certifi 2022.12.7 cffi 1.15.1 chardet 4.0.0 charset-normalizer 2.0.4 click 8.1.4 conda 23.1.0 conda-build 3.23.3 conda-content-trust 0.1.3 conda-package-handling 2.0.2 conda_package_streaming 0.7.0 contourpy 1.1.0 cryptography 39.0.1 cycler 0.11.0 datasets 2.14.1 decorator 5.1.1 deepspeed 0.9.5 dill 0.3.6 dnspython 2.3.0 exceptiongroup 1.1.1 executing 0.8.3 expecttest 0.1.4 fastapi 0.95.1 ffmpy 0.3.0 filelock 3.9.0 fire 0.5.0 flit_core 3.6.0 fonttools 4.40.0 frozenlist 1.3.3 fsspec 2023.6.0 glob2 0.7 gmpy2 2.1.2 google-auth 2.22.0 google-auth-oauthlib 1.0.0 gradio 3.39.0 gradio_client 0.3.0 grpcio 1.56.0 h11 0.14.0 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.1 huggingface-hub 0.16.4 hypothesis 6.70.0 idna 3.4 ipython 8.10.0 jedi 0.18.1 jieba 0.42.1 Jinja2 3.1.2 joblib 1.3.1 jsonschema 4.18.0 jsonschema-specifications 2023.6.1 kiwisolver 1.4.4 libarchive-c 2.9 linkify-it-py 2.0.2 Markdown 3.4.3 markdown-it-py 2.2.0 MarkupSafe 2.1.1 matplotlib 3.7.2 matplotlib-inline 0.1.6 mdit-py-plugins 0.3.3 mdurl 0.1.2 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 mypy-extensions 1.0.0 networkx 3.0 ninja 1.11.1 nltk 3.8.1 numpy 1.23.5 oauthlib 3.2.2 orjson 3.9.2 packaging 23.1 pandas 1.5.3 parso 0.8.3 pathspec 0.11.1 peft 0.4.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.4.0 pip 22.3.1 pkginfo 1.8.3 platformdirs 3.8.1 pluggy 1.0.0 prompt-toolkit 3.0.36 protobuf 4.23.4 psutil 5.9.0 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 12.0.1 pyasn1 0.5.0 pyasn1-modules 0.3.0 pycosat 0.6.4 pycparser 2.21 pydantic 1.10.11 pydub 0.25.1 Pygments 2.11.2 pyOpenSSL 23.0.0 pyparsing 3.0.9 PySocks 1.7.1 python-dateutil 2.8.2 python-etcd 0.4.5 python-multipart 0.0.6 pytz 2022.7 PyYAML 6.0 referencing 0.29.1 regex 2023.6.3 requests 2.28.1 requests-oauthlib 1.3.1 responses 0.18.0 rouge-chinese 1.0.3 rpds-py 0.8.10 rsa 4.9 ruamel.yaml 0.17.21 ruamel.yaml.clib 0.2.6 safetensors 0.3.1 semantic-version 2.10.0 sentencepiece 0.1.99 setuptools 65.6.3 six 1.16.0 sniffio 1.3.0 sortedcontainers 2.4.0 soupsieve 2.3.2.post1 sse-starlette 1.6.1 stack-data 0.2.0 starlette 0.26.1 sympy 1.11.1 tensorboard 2.13.0 tensorboard-data-server 0.7.1 termcolor 2.3.0 tokenize-rt 5.1.0 tokenizers 0.13.3 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 2.0.0 torchaudio 2.0.0 torchdata 0.6.0 torchelastic 0.2.2 torchtext 0.15.0 torchvision 0.15.0 tqdm 4.64.1 traitlets 5.7.1 transformers 4.30.1 triton 2.0.0 trl 0.4.7 types-dataclasses 0.6.6 typing_extensions 4.5.0 uc-micro-py 1.0.2 urllib3 1.26.14 uvicorn 0.22.0 wcwidth 0.2.5 websockets 11.0.3 Werkzeug 2.3.6 wheel 0.37.1 xxhash 3.2.0 yarl 1.9.2 zstandard 0.19.0
DeepSpeed Config:
{ "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "weight_decay": "auto", "betas": "auto", "eps": "auto", "torch_adam": true, "adam_w_mode": true } }, "scheduler": { "type": "WarmupDecayLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto", "total_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "overlap_comm": true, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "sub_group_size": 1000000000, "stage3_max_live_parameters": 1000000000, "stage3_max_reuse_distance": 1000000000, "stage3_gather_16bit_weights_on_model_save": "auto", "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true } }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 20, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false, "dump_state": true }
运行日志:
root@22f6e415e3a4:/mnt/nfs207/mnt/disk2/LLaMA-Efficient-Tuning# deepspeed -i localhost:1,2,3,4,5,6,7 src/train_bash.py --stage sft --model_name_or_path /mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf --do_train --dataset alpaca_gpt4_en --finetuning_type lora --output_dir /tmp/output --overwrite_cache --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --logging_steps 1 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 1.0 --fp16 --prompt_template llama2 --use_fast_tokenizer --deepspeed ./zero_config.json [2023-07-31 19:12:36,679] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... [2023-07-31 19:12:40,172] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-07-31 19:12:40,172] [INFO] [runner.py:555:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None src/train_bash.py --stage sft --model_name_or_path /mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf --do_train --dataset alpaca_gpt4_en --finetuning_type lora --output_dir /tmp/output --overwrite_cache --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --logging_steps 1 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 1.0 --fp16 --prompt_template llama2 --use_fast_tokenizer --deepspeed ./zero_config.json [2023-07-31 19:12:41,664] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... [2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7 [2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.13.4-1 [2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1 [2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7 [2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1 [2023-07-31 19:12:43,035] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [1, 2, 3, 4, 5, 6, 7]} [2023-07-31 19:12:43,035] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=7, node_rank=0 [2023-07-31 19:12:43,036] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6]}) [2023-07-31 19:12:43,036] [INFO] [launch.py:163:main] dist_world_size=7 [2023-07-31 19:12:43,036] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... [2023-07-31 19:12:46,898] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-07-31 19:12:46,910] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-07-31 19:12:46,920] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-07-31 19:12:46,956] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-07-31 19:12:46,959] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-07-31 19:12:46,973] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-07-31 19:12:47,004] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-07-31 19:12:49,145] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-07-31 19:12:49,145] [INFO] [comm.py:594:init_distributed] cdb=None [2023-07-31 19:12:49,145] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-07-31 19:12:49,145] [INFO] [comm.py:594:init_distributed] cdb=None [2023-07-31 19:12:49,165] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-07-31 19:12:49,166] [INFO] [comm.py:594:init_distributed] cdb=None [2023-07-31 19:12:49,194] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-07-31 19:12:49,194] [INFO] [comm.py:594:init_distributed] cdb=None [2023-07-31 19:12:49,229] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-07-31 19:12:49,229] [INFO] [comm.py:594:init_distributed] cdb=None [2023-07-31 19:12:49,229] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-07-31 19:12:49,252] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-07-31 19:12:49,252] [INFO] [comm.py:594:init_distributed] cdb=None [2023-07-31 19:12:49,515] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-07-31 19:12:49,515] [INFO] [comm.py:594:init_distributed] cdb=None 07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training. 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: True 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=./zero_config.json, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=/tmp/output, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=/tmp/output, save_on_each_node=False, save_safetensors=False, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json... 07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training. 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 5, device: cuda:5, n_gpu: 1 distributed training: True, 16-bits training: True 07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training. 07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training. 07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training. 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 4, device: cuda:4, n_gpu: 1 distributed training: True, 16-bits training: True 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 6, device: cuda:6, n_gpu: 1 distributed training: True, 16-bits training: True 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 1, device: cuda:1, n_gpu: 1 distributed training: True, 16-bits training: True 07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training. 07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training. 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 3, device: cuda:3, n_gpu: 1 distributed training: True, 16-bits training: True 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 2, device: cuda:2, n_gpu: 1 distributed training: True, 16-bits training: True 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=./zero_config.json, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=5, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=/tmp/output, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=/tmp/output, save_on_each_node=False, save_safetensors=False, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=./zero_config.json, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=4, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=/tmp/output, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=/tmp/output, save_on_each_node=False, save_safetensors=False, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=./zero_config.json, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=6, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=/tmp/output, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=/tmp/output, save_on_each_node=False, save_safetensors=False, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=./zero_config.json, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=1, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=/tmp/output, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=/tmp/output, save_on_each_node=False, save_safetensors=False, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=./zero_config.json, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=3, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=/tmp/output, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=/tmp/output, save_on_each_node=False, save_safetensors=False, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json... 07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=./zero_config.json, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=2, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=/tmp/output, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=/tmp/output, save_on_each_node=False, save_safetensors=False, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json... 07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json... 07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json... 07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json... 07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json... /opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( /opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( /opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( /opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( /opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( /opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( /opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( Using custom data configuration default-91f63e1f2acd5b18 Loading Dataset Infos from /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json Overwrite dataset info from restored data version if exists. Loading Dataset info from /root/.cache/huggingface/datasets/json/default-91f63e1f2acd5b18/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96 Found cached dataset json (/root/.cache/huggingface/datasets/json/default-91f63e1f2acd5b18/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96) Loading Dataset info from /root/.cache/huggingface/datasets/json/default-91f63e1f2acd5b18/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96 [INFO|tokenization_utils_base.py:1821] 2023-07-31 19:12:51,319 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:1821] 2023-07-31 19:12:51,320 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:1821] 2023-07-31 19:12:51,320 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:1821] 2023-07-31 19:12:51,320 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:1821] 2023-07-31 19:12:51,320 >> loading file tokenizer_config.json [INFO|configuration_utils.py:667] 2023-07-31 19:12:51,385 >> loading configuration file /mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf/config.json [INFO|configuration_utils.py:725] 2023-07-31 19:12:51,386 >> Model config LlamaConfig { "_name_or_path": "/mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf", "architectures": [ "LlamaForCausalLM" ], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 28672, "max_position_embeddings": 2048, "model_type": "llama", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "pad_token_id": 0, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.30.1", "use_cache": true, "vocab_size": 32000 } [INFO|modeling_utils.py:2575] 2023-07-31 19:12:51,405 >> loading weights file /mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf/pytorch_model.bin.index.json [INFO|modeling_utils.py:1173] 2023-07-31 19:12:51,406 >> Instantiating LlamaForCausalLM model under default dtype torch.float16. [INFO|modeling_utils.py:2669] 2023-07-31 19:12:51,406 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model [INFO|configuration_utils.py:577] 2023-07-31 19:12:51,410 >> Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 0, "transformers_version": "4.30.1" } [2023-07-31 19:12:58,056] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1386 [2023-07-31 19:12:58,058] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1387 [2023-07-31 19:12:58,281] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1388 [2023-07-31 19:12:58,281] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1389 [2023-07-31 19:12:58,282] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1390 [2023-07-31 19:12:58,537] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1391 [2023-07-31 19:12:58,538] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1392 [2023-07-31 19:12:58,539] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', 'src/train_bash.py', '--local_rank=6', '--stage', 'sft', '--model_name_or_path', '/mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf', '--do_train', '--dataset', 'alpaca_gpt4_en', '--finetuning_type', 'lora', '--output_dir', '/tmp/output', '--overwrite_cache', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--save_steps', '1000', '--learning_rate', '5e-5', '--num_train_epochs', '1.0', '--fp16', '--prompt_template', 'llama2', '--use_fast_tokenizer', '--deepspeed', './zero_config.json'] exits with return code = -7
显存满了
忘记说这是docker容器环境了抱歉。 最后排查下来是容器的共享内存在默认设置下太小(64m),指定更大的共享内存大小就好了
很奇怪,我用的最新的版本。我设置了共享内存900GB(宿主机有1000GB)还是会-7错误,而且没别的提示
返回码-7,但没有输出任何错误详情。
DeepSpeed Config:
运行日志: