hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.21k stars 4.21k forks source link

单机多卡DeepSpeed sft训练Llama2-70b-chat-hf时报错:exits with return code = -7 #285

Closed qingqiuhe closed 1 year ago

qingqiuhe commented 1 year ago

返回码-7,但没有输出任何错误详情。

# pip list
Package                   Version
------------------------- -----------
absl-py                   1.4.0
accelerate                0.21.0
aiofiles                  23.1.0
aiohttp                   3.8.4
aiosignal                 1.3.1
altair                    5.0.1
anyio                     3.7.1
appdirs                   1.4.4
asttokens                 2.0.5
astunparse                1.6.3
async-timeout             4.0.2
attrs                     22.2.0
backcall                  0.2.0
beautifulsoup4            4.11.1
bitsandbytes              0.37.1
black                     23.3.0
brotlipy                  0.7.0
cachetools                5.3.1
certifi                   2022.12.7
cffi                      1.15.1
chardet                   4.0.0
charset-normalizer        2.0.4
click                     8.1.4
conda                     23.1.0
conda-build               3.23.3
conda-content-trust       0.1.3
conda-package-handling    2.0.2
conda_package_streaming   0.7.0
contourpy                 1.1.0
cryptography              39.0.1
cycler                    0.11.0
datasets                  2.14.1
decorator                 5.1.1
deepspeed                 0.9.5
dill                      0.3.6
dnspython                 2.3.0
exceptiongroup            1.1.1
executing                 0.8.3
expecttest                0.1.4
fastapi                   0.95.1
ffmpy                     0.3.0
filelock                  3.9.0
fire                      0.5.0
flit_core                 3.6.0
fonttools                 4.40.0
frozenlist                1.3.3
fsspec                    2023.6.0
glob2                     0.7
gmpy2                     2.1.2
google-auth               2.22.0
google-auth-oauthlib      1.0.0
gradio                    3.39.0
gradio_client             0.3.0
grpcio                    1.56.0
h11                       0.14.0
hjson                     3.1.0
httpcore                  0.17.3
httpx                     0.24.1
huggingface-hub           0.16.4
hypothesis                6.70.0
idna                      3.4
ipython                   8.10.0
jedi                      0.18.1
jieba                     0.42.1
Jinja2                    3.1.2
joblib                    1.3.1
jsonschema                4.18.0
jsonschema-specifications 2023.6.1
kiwisolver                1.4.4
libarchive-c              2.9
linkify-it-py             2.0.2
Markdown                  3.4.3
markdown-it-py            2.2.0
MarkupSafe                2.1.1
matplotlib                3.7.2
matplotlib-inline         0.1.6
mdit-py-plugins           0.3.3
mdurl                     0.1.2
mkl-fft                   1.3.1
mkl-random                1.2.2
mkl-service               2.4.0
mpmath                    1.3.0
multidict                 6.0.4
multiprocess              0.70.14
mypy-extensions           1.0.0
networkx                  3.0
ninja                     1.11.1
nltk                      3.8.1
numpy                     1.23.5
oauthlib                  3.2.2
orjson                    3.9.2
packaging                 23.1
pandas                    1.5.3
parso                     0.8.3
pathspec                  0.11.1
peft                      0.4.0
pexpect                   4.8.0
pickleshare               0.7.5
Pillow                    9.4.0
pip                       22.3.1
pkginfo                   1.8.3
platformdirs              3.8.1
pluggy                    1.0.0
prompt-toolkit            3.0.36
protobuf                  4.23.4
psutil                    5.9.0
ptyprocess                0.7.0
pure-eval                 0.2.2
py-cpuinfo                9.0.0
pyarrow                   12.0.1
pyasn1                    0.5.0
pyasn1-modules            0.3.0
pycosat                   0.6.4
pycparser                 2.21
pydantic                  1.10.11
pydub                     0.25.1
Pygments                  2.11.2
pyOpenSSL                 23.0.0
pyparsing                 3.0.9
PySocks                   1.7.1
python-dateutil           2.8.2
python-etcd               0.4.5
python-multipart          0.0.6
pytz                      2022.7
PyYAML                    6.0
referencing               0.29.1
regex                     2023.6.3
requests                  2.28.1
requests-oauthlib         1.3.1
responses                 0.18.0
rouge-chinese             1.0.3
rpds-py                   0.8.10
rsa                       4.9
ruamel.yaml               0.17.21
ruamel.yaml.clib          0.2.6
safetensors               0.3.1
semantic-version          2.10.0
sentencepiece             0.1.99
setuptools                65.6.3
six                       1.16.0
sniffio                   1.3.0
sortedcontainers          2.4.0
soupsieve                 2.3.2.post1
sse-starlette             1.6.1
stack-data                0.2.0
starlette                 0.26.1
sympy                     1.11.1
tensorboard               2.13.0
tensorboard-data-server   0.7.1
termcolor                 2.3.0
tokenize-rt               5.1.0
tokenizers                0.13.3
toml                      0.10.2
tomli                     2.0.1
toolz                     0.12.0
torch                     2.0.0
torchaudio                2.0.0
torchdata                 0.6.0
torchelastic              0.2.2
torchtext                 0.15.0
torchvision               0.15.0
tqdm                      4.64.1
traitlets                 5.7.1
transformers              4.30.1
triton                    2.0.0
trl                       0.4.7
types-dataclasses         0.6.6
typing_extensions         4.5.0
uc-micro-py               1.0.2
urllib3                   1.26.14
uvicorn                   0.22.0
wcwidth                   0.2.5
websockets                11.0.3
Werkzeug                  2.3.6
wheel                     0.37.1
xxhash                    3.2.0
yarl                      1.9.2
zstandard                 0.19.0

DeepSpeed Config:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto",
            "betas": "auto",
            "eps": "auto",
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "sub_group_size": 1000000000,
        "stage3_max_live_parameters": 1000000000,
        "stage3_max_reuse_distance": 1000000000,
        "stage3_gather_16bit_weights_on_model_save": "auto",
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        }
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 20,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "dump_state": true
}

运行日志:

root@22f6e415e3a4:/mnt/nfs207/mnt/disk2/LLaMA-Efficient-Tuning# deepspeed -i localhost:1,2,3,4,5,6,7  src/train_bash.py     --stage sft     --model_name_or_path  /mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf     --do_train     --dataset alpaca_gpt4_en     --finetuning_type lora     --output_dir /tmp/output     --overwrite_cache     --per_device_train_batch_size 1     --gradient_accumulation_steps 1     --lr_scheduler_type cosine     --logging_steps 1     --save_steps 1000     --learning_rate 5e-5     --num_train_epochs 1.0     --fp16     --prompt_template llama2     --use_fast_tokenizer     --deepspeed ./zero_config.json
[2023-07-31 19:12:36,679] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
[2023-07-31 19:12:40,172] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-07-31 19:12:40,172] [INFO] [runner.py:555:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None src/train_bash.py --stage sft --model_name_or_path /mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf --do_train --dataset alpaca_gpt4_en --finetuning_type lora --output_dir /tmp/output --overwrite_cache --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --logging_steps 1 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 1.0 --fp16 --prompt_template llama2 --use_fast_tokenizer --deepspeed ./zero_config.json
[2023-07-31 19:12:41,664] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
  warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
[2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.13.4-1
[2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-07-31 19:12:43,035] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-07-31 19:12:43,035] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [1, 2, 3, 4, 5, 6, 7]}
[2023-07-31 19:12:43,035] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=7, node_rank=0
[2023-07-31 19:12:43,036] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6]})
[2023-07-31 19:12:43,036] [INFO] [launch.py:163:main] dist_world_size=7
[2023-07-31 19:12:43,036] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
  warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
  warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
  warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
[2023-07-31 19:12:46,898] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-31 19:12:46,910] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-31 19:12:46,920] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-31 19:12:46,956] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-31 19:12:46,959] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-31 19:12:46,973] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-31 19:12:47,004] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-31 19:12:49,145] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-31 19:12:49,145] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-31 19:12:49,145] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-31 19:12:49,145] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-31 19:12:49,165] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-31 19:12:49,166] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-31 19:12:49,194] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-31 19:12:49,194] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-31 19:12:49,229] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-31 19:12:49,229] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-31 19:12:49,229] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-07-31 19:12:49,252] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-31 19:12:49,252] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-31 19:12:49,515] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-31 19:12:49,515] [INFO] [comm.py:594:init_distributed] cdb=None
07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training.
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 0, device: cuda:0, n_gpu: 1
  distributed training: True, 16-bits training: True
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=./zero_config.json,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
output_dir=/tmp/output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/tmp/output,
save_on_each_node=False,
save_safetensors=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json...
07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training.
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 5, device: cuda:5, n_gpu: 1
  distributed training: True, 16-bits training: True
07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training.
07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training.
07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training.
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 4, device: cuda:4, n_gpu: 1
  distributed training: True, 16-bits training: True
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 6, device: cuda:6, n_gpu: 1
  distributed training: True, 16-bits training: True
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 1, device: cuda:1, n_gpu: 1
  distributed training: True, 16-bits training: True
07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training.
07/31/2023 19:12:50 - WARNING - llmtuner.tuner.core.parser - `ddp_find_unused_parameters` needs to be set as False in DDP training.
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 3, device: cuda:3, n_gpu: 1
  distributed training: True, 16-bits training: True
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Process rank: 2, device: cuda:2, n_gpu: 1
  distributed training: True, 16-bits training: True
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=./zero_config.json,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=5,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
output_dir=/tmp/output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/tmp/output,
save_on_each_node=False,
save_safetensors=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=./zero_config.json,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=4,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
output_dir=/tmp/output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/tmp/output,
save_on_each_node=False,
save_safetensors=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=./zero_config.json,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=6,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
output_dir=/tmp/output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/tmp/output,
save_on_each_node=False,
save_safetensors=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=./zero_config.json,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=1,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
output_dir=/tmp/output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/tmp/output,
save_on_each_node=False,
save_safetensors=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=./zero_config.json,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=3,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
output_dir=/tmp/output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/tmp/output,
save_on_each_node=False,
save_safetensors=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json...
07/31/2023 19:12:50 - INFO - llmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=./zero_config.json,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=2,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/tmp/output/runs/Jul31_19-12-49_22f6e415e3a4,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
output_dir=/tmp/output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/tmp/output,
save_on_each_node=False,
save_safetensors=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json...
07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json...
07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json...
07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json...
07/31/2023 19:12:50 - INFO - llmtuner.dsets.loader - Loading dataset alpaca_gpt4_data_en.json...
/opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/datasets/load.py:2069: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
Using custom data configuration default-91f63e1f2acd5b18
Loading Dataset Infos from /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from /root/.cache/huggingface/datasets/json/default-91f63e1f2acd5b18/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-91f63e1f2acd5b18/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading Dataset info from /root/.cache/huggingface/datasets/json/default-91f63e1f2acd5b18/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
[INFO|tokenization_utils_base.py:1821] 2023-07-31 19:12:51,319 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:1821] 2023-07-31 19:12:51,320 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:1821] 2023-07-31 19:12:51,320 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1821] 2023-07-31 19:12:51,320 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1821] 2023-07-31 19:12:51,320 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:667] 2023-07-31 19:12:51,385 >> loading configuration file /mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf/config.json
[INFO|configuration_utils.py:725] 2023-07-31 19:12:51,386 >> Model config LlamaConfig {
  "_name_or_path": "/mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.30.1",
  "use_cache": true,
  "vocab_size": 32000
}

[INFO|modeling_utils.py:2575] 2023-07-31 19:12:51,405 >> loading weights file /mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf/pytorch_model.bin.index.json
[INFO|modeling_utils.py:1173] 2023-07-31 19:12:51,406 >> Instantiating LlamaForCausalLM model under default dtype torch.float16.
[INFO|modeling_utils.py:2669] 2023-07-31 19:12:51,406 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[INFO|configuration_utils.py:577] 2023-07-31 19:12:51,410 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 0,
  "transformers_version": "4.30.1"
}

[2023-07-31 19:12:58,056] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1386
[2023-07-31 19:12:58,058] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1387
[2023-07-31 19:12:58,281] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1388
[2023-07-31 19:12:58,281] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1389
[2023-07-31 19:12:58,282] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1390
[2023-07-31 19:12:58,537] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1391
[2023-07-31 19:12:58,538] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1392
[2023-07-31 19:12:58,539] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', 'src/train_bash.py', '--local_rank=6', '--stage', 'sft', '--model_name_or_path', '/mnt/nfs207/mnt/disk2/Llama-2-70b-chat-hf', '--do_train', '--dataset', 'alpaca_gpt4_en', '--finetuning_type', 'lora', '--output_dir', '/tmp/output', '--overwrite_cache', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--save_steps', '1000', '--learning_rate', '5e-5', '--num_train_epochs', '1.0', '--fp16', '--prompt_template', 'llama2', '--use_fast_tokenizer', '--deepspeed', './zero_config.json'] exits with return code = -7
hiyouga commented 1 year ago

显存满了

qingqiuhe commented 1 year ago

忘记说这是docker容器环境了抱歉。 最后排查下来是容器的共享内存在默认设置下太小(64m),指定更大的共享内存大小就好了

linchen111 commented 6 months ago

忘记说这是docker容器环境了抱歉。 最后排查下来是容器的共享内存在默认设置下太小(64m),指定更大的共享内存大小就好了

很奇怪,我用的最新的版本。我设置了共享内存900GB(宿主机有1000GB)还是会-7错误,而且没别的提示