Closed han508 closed 4 months ago

han508 commented 5 months ago

System Info

Suddenly the process is killed and No errors were reported,When I use a model that is bigger than or equal to 7B. I encountered this problem with both intenlm2 and qwen1.5 7b, but it worked fine with qwen.5 4b or 1.8b. My GPU is Tesla T4 CUDA:11.8

py env: accelerate 0.28.0 addict 2.4.0 aiohttp 3.9.3 aiosignal 1.3.1 aliyun-python-sdk-core 2.15.0 aliyun-python-sdk-kms 2.16.2 annotated-types 0.6.0 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.2.0 Brotli 1.0.9 certifi 2022.12.7 cffi 1.16.0 charset-normalizer 2.0.4 click 8.1.7 cmake 3.25.0 comm 0.2.2 crcmod 1.7 cryptography 42.0.5 datasets 2.18.0 debugpy 1.6.7 decorator 5.1.1 deepspeed 0.14.0 dill 0.3.8 docker-pycreds 0.4.0 einops 0.7.0 exceptiongroup 1.2.0 executing 2.0.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 gast 0.5.4 gitdb 4.0.11 GitPython 3.1.42 gmpy2 2.1.2 hjson 3.1.0 huggingface-hub 0.22.2 idna 3.4 importlib_metadata 7.1.0 ipykernel 6.29.3 ipython 8.22.2 jedi 0.19.1 Jinja2 3.1.2 jmespath 0.10.0 jupyter_client 8.6.1 jupyter_core 5.7.2 lit 15.0.7 llvmlite 0.42.0 MarkupSafe 2.1.3 matplotlib-inline 0.1.6 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 modelscope 1.13.3 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 nest_asyncio 1.6.0 networkx 3.1 ninja numba 0.59.1 numpy 1.26.3 nvidia-cublas-cu11 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cudnn-cu11 nvidia-cufft-cu11 nvidia-curand-cu11 nvidia-cusolver-cu11 nvidia-cusparse-cu11 nvidia-nccl-cu11 2.19.3 nvidia-nvtx-cu11 11.8.86 oss2 2.18.4 packaging 24.0 pandas 2.2.1 parso 0.8.3 pexpect 4.9.0 pickleshare 0.7.5 pillow 10.2.0 pip 23.3.1 platformdirs 4.2.0 prompt-toolkit 3.0.42 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pycparser 2.22 pycryptodome 3.20.0 pydantic 2.6.4 pydantic_core 2.16.3 Pygments 2.17.2 pynvml 11.5.0 PySocks 1.7.1 python-dateutil 2.9.0 pytz 2024.1 PyYAML 6.0.1 pyzmq 25.1.2 regex 2023.12.25 requests 2.28.1 safetensors 0.4.2 scipy 1.12.0 sentencepiece 0.2.0 sentry-sdk 1.44.0 setproctitle 1.3.3 setuptools 68.2.2 simplejson 3.19.2 six 1.16.0 smmap 5.0.1 sortedcontainers 2.4.0 stack-data 0.6.2 sympy 1.12 tiktoken 0.6.0 tokenizers 0.15.2 tomli 2.0.1 torch 1.13.1+cu117 torchaudio 0.13.1+cu117 torchvision 0.14.1+cu117 tornado 6.4 tqdm 4.66.2 traitlets 5.14.2 transformers 4.39.2 transformers-stream-generator 0.0.5 triton 2.1.0 typing_extensions 4.8.0 tzdata 2024.1 urllib3 1.26.13 wandb 0.16.5 wcwidth 0.2.13 wheel 0.41.2 xxhash 3.4.1 yapf 0.40.2 yarl 1.9.4 zipp 3.17.0

start: deepspeed --include=localhost:0,1,2,3,4,5,6 --master_port=25640 /home/han2/emo/ \ --model_path /home/han2/model/Shanghai_AI_Laboratory/internlm2-7b\ --data_path /home/han2/data_sft/dataset\ --output_dir /home/han2/0331_7b \ --do_train True \ --do_eval False \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type 'cosine' \ --warmup_ratio 0.03 \ --save_strategy "epoch" \ --logging_steps 1 \ --num_train_epochs 3 \ --learning_rate 1e-5 \ --fp16 True \ --save_safetensors \ --seed 2025 \ --deepspeed '/home/han2/emo/config/ds.json' \

deepspeed config : { "fp16": { "enabled": true },

"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true,
    "memory_efficient_linear": false
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"steps_per_print": 10,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"prescale_gradients": false,
"wall_clock_breakdown": false



Expected behavior

solve this question

aliencaocao commented 5 months ago

not enough ram

