huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.46k stars 26.38k forks source link

Suddenly the process is killed #29987

Closed han508 closed 4 months ago

han508 commented 5 months ago

System Info

Suddenly the process is killed and No errors were reported,When I use a model that is bigger than or equal to 7B. I encountered this problem with both intenlm2 and qwen1.5 7b, but it worked fine with qwen.5 4b or 1.8b. My GPU is Tesla T4 CUDA:11.8

py env: accelerate 0.28.0 addict 2.4.0 aiohttp 3.9.3 aiosignal 1.3.1 aliyun-python-sdk-core 2.15.0 aliyun-python-sdk-kms 2.16.2 annotated-types 0.6.0 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.2.0 Brotli 1.0.9 certifi 2022.12.7 cffi 1.16.0 charset-normalizer 2.0.4 click 8.1.7 cmake 3.25.0 comm 0.2.2 crcmod 1.7 cryptography 42.0.5 datasets 2.18.0 debugpy 1.6.7 decorator 5.1.1 deepspeed 0.14.0 dill 0.3.8 docker-pycreds 0.4.0 einops 0.7.0 exceptiongroup 1.2.0 executing 2.0.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 gast 0.5.4 gitdb 4.0.11 GitPython 3.1.42 gmpy2 2.1.2 hjson 3.1.0 huggingface-hub 0.22.2 idna 3.4 importlib_metadata 7.1.0 ipykernel 6.29.3 ipython 8.22.2 jedi 0.19.1 Jinja2 3.1.2 jmespath 0.10.0 jupyter_client 8.6.1 jupyter_core 5.7.2 lit 15.0.7 llvmlite 0.42.0 MarkupSafe 2.1.3 matplotlib-inline 0.1.6 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 modelscope 1.13.3 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 nest_asyncio 1.6.0 networkx 3.1 ninja 1.11.1.1 numba 0.59.1 numpy 1.26.3 nvidia-cublas-cu11 11.11.3.6 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cudnn-cu11 8.7.0.84 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.3.0.86 nvidia-cusolver-cu11 11.4.1.48 nvidia-cusparse-cu11 11.7.5.86 nvidia-nccl-cu11 2.19.3 nvidia-nvtx-cu11 11.8.86 oss2 2.18.4 packaging 24.0 pandas 2.2.1 parso 0.8.3 pexpect 4.9.0 pickleshare 0.7.5 pillow 10.2.0 pip 23.3.1 platformdirs 4.2.0 prompt-toolkit 3.0.42 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pycparser 2.22 pycryptodome 3.20.0 pydantic 2.6.4 pydantic_core 2.16.3 Pygments 2.17.2 pynvml 11.5.0 PySocks 1.7.1 python-dateutil 2.9.0 pytz 2024.1 PyYAML 6.0.1 pyzmq 25.1.2 regex 2023.12.25 requests 2.28.1 safetensors 0.4.2 scipy 1.12.0 sentencepiece 0.2.0 sentry-sdk 1.44.0 setproctitle 1.3.3 setuptools 68.2.2 simplejson 3.19.2 six 1.16.0 smmap 5.0.1 sortedcontainers 2.4.0 stack-data 0.6.2 sympy 1.12 tiktoken 0.6.0 tokenizers 0.15.2 tomli 2.0.1 torch 1.13.1+cu117 torchaudio 0.13.1+cu117 torchvision 0.14.1+cu117 tornado 6.4 tqdm 4.66.2 traitlets 5.14.2 transformers 4.39.2 transformers-stream-generator 0.0.5 triton 2.1.0 typing_extensions 4.8.0 tzdata 2024.1 urllib3 1.26.13 wandb 0.16.5 wcwidth 0.2.13 wheel 0.41.2 xxhash 3.4.1 yapf 0.40.2 yarl 1.9.4 zipp 3.17.0

start: deepspeed --include=localhost:0,1,2,3,4,5,6 --master_port=25640 /home/han2/emo/train.py \ --model_path /home/han2/model/Shanghai_AI_Laboratory/internlm2-7b\ --data_path /home/han2/data_sft/dataset\ --output_dir /home/han2/0331_7b \ --do_train True \ --do_eval False \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type 'cosine' \ --warmup_ratio 0.03 \ --save_strategy "epoch" \ --logging_steps 1 \ --num_train_epochs 3 \ --learning_rate 1e-5 \ --fp16 True \ --save_safetensors \ --seed 2025 \ --deepspeed '/home/han2/emo/config/ds.json' \

deepspeed config : { "fp16": { "enabled": true },

"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
    }
},
"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true,
    "memory_efficient_linear": false
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"steps_per_print": 10,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"prescale_gradients": false,
"wall_clock_breakdown": false

}

image

Who can help?

@pacman100

Information

Tasks

Reproduction

the simple trainer official example scripts

Expected behavior

solve this question

aliencaocao commented 5 months ago

not enough ram

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.