Closed han508 closed 4 months ago
not enough ram
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Suddenly the process is killed and No errors were reported,When I use a model that is bigger than or equal to 7B. I encountered this problem with both intenlm2 and qwen1.5 7b, but it worked fine with qwen.5 4b or 1.8b. My GPU is Tesla T4 CUDA:11.8
py env: accelerate 0.28.0 addict 2.4.0 aiohttp 3.9.3 aiosignal 1.3.1 aliyun-python-sdk-core 2.15.0 aliyun-python-sdk-kms 2.16.2 annotated-types 0.6.0 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.2.0 Brotli 1.0.9 certifi 2022.12.7 cffi 1.16.0 charset-normalizer 2.0.4 click 8.1.7 cmake 3.25.0 comm 0.2.2 crcmod 1.7 cryptography 42.0.5 datasets 2.18.0 debugpy 1.6.7 decorator 5.1.1 deepspeed 0.14.0 dill 0.3.8 docker-pycreds 0.4.0 einops 0.7.0 exceptiongroup 1.2.0 executing 2.0.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 gast 0.5.4 gitdb 4.0.11 GitPython 3.1.42 gmpy2 2.1.2 hjson 3.1.0 huggingface-hub 0.22.2 idna 3.4 importlib_metadata 7.1.0 ipykernel 6.29.3 ipython 8.22.2 jedi 0.19.1 Jinja2 3.1.2 jmespath 0.10.0 jupyter_client 8.6.1 jupyter_core 5.7.2 lit 15.0.7 llvmlite 0.42.0 MarkupSafe 2.1.3 matplotlib-inline 0.1.6 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 modelscope 1.13.3 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 nest_asyncio 1.6.0 networkx 3.1 ninja 1.11.1.1 numba 0.59.1 numpy 1.26.3 nvidia-cublas-cu11 11.11.3.6 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cudnn-cu11 8.7.0.84 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.3.0.86 nvidia-cusolver-cu11 11.4.1.48 nvidia-cusparse-cu11 11.7.5.86 nvidia-nccl-cu11 2.19.3 nvidia-nvtx-cu11 11.8.86 oss2 2.18.4 packaging 24.0 pandas 2.2.1 parso 0.8.3 pexpect 4.9.0 pickleshare 0.7.5 pillow 10.2.0 pip 23.3.1 platformdirs 4.2.0 prompt-toolkit 3.0.42 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pycparser 2.22 pycryptodome 3.20.0 pydantic 2.6.4 pydantic_core 2.16.3 Pygments 2.17.2 pynvml 11.5.0 PySocks 1.7.1 python-dateutil 2.9.0 pytz 2024.1 PyYAML 6.0.1 pyzmq 25.1.2 regex 2023.12.25 requests 2.28.1 safetensors 0.4.2 scipy 1.12.0 sentencepiece 0.2.0 sentry-sdk 1.44.0 setproctitle 1.3.3 setuptools 68.2.2 simplejson 3.19.2 six 1.16.0 smmap 5.0.1 sortedcontainers 2.4.0 stack-data 0.6.2 sympy 1.12 tiktoken 0.6.0 tokenizers 0.15.2 tomli 2.0.1 torch 1.13.1+cu117 torchaudio 0.13.1+cu117 torchvision 0.14.1+cu117 tornado 6.4 tqdm 4.66.2 traitlets 5.14.2 transformers 4.39.2 transformers-stream-generator 0.0.5 triton 2.1.0 typing_extensions 4.8.0 tzdata 2024.1 urllib3 1.26.13 wandb 0.16.5 wcwidth 0.2.13 wheel 0.41.2 xxhash 3.4.1 yapf 0.40.2 yarl 1.9.4 zipp 3.17.0
start: deepspeed --include=localhost:0,1,2,3,4,5,6 --master_port=25640 /home/han2/emo/train.py \ --model_path /home/han2/model/Shanghai_AI_Laboratory/internlm2-7b\ --data_path /home/han2/data_sft/dataset\ --output_dir /home/han2/0331_7b \ --do_train True \ --do_eval False \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type 'cosine' \ --warmup_ratio 0.03 \ --save_strategy "epoch" \ --logging_steps 1 \ --num_train_epochs 3 \ --learning_rate 1e-5 \ --fp16 True \ --save_safetensors \ --seed 2025 \ --deepspeed '/home/han2/emo/config/ds.json' \
deepspeed config : { "fp16": { "enabled": true },
}
Who can help?
@pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
the simple trainer official example scripts
Expected behavior
solve this question