lyogavin / airllm

AirLLM 70B inference with single 4GB GPU
Apache License 2.0
5.14k stars 411 forks source link

训练复现报错 #7

Closed acbogeh closed 1 year ago

acbogeh commented 1 year ago

++ echo 'START TIME: Mon Jun 19 19:22:10 CST 2023' START TIME: Mon Jun 19 19:22:10 CST 2023 ++ ROOT_DIR_BASE=/Anima/saved_models/qlora_cn ++ OUTPUT_PATH=/Anima/saved_models/qlora_cn/output_1687173730 ++ mkdir -p /Anima/saved_models/qlora_cn/output_1687173730 ++ python qlora.py --dataset=chinese-vicuna --dataset_format=alpaca-clean --learning_rate 0.0001 --per_device_train_batch_size 1 --gradient_accumulation_steps 16 --max_steps 10000 --model_name_or_path timdettmers/guanaco-33b-merged --source_max_len 512 --target_max_len 512 --eval_dataset_size 1 --do_eval --evaluation_strategy steps --eval_steps 200 --output_dir /Anima/saved_models/qlora_cn/output_1687173730 --report_to wandb --sample_generate --save_steps 200 ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/miniconda3/envs/anima/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda121.so /root/miniconda3/envs/anima/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:148: UserWarning: /root/miniconda3/envs/anima did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /root/miniconda3/envs/anima/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:148: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/lcoal/cuda/lib64')} warn(msg) /root/miniconda3/envs/anima/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:148: UserWarning: /usr/lcoal/cuda/lib64: did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 121 CUDA SETUP: Loading binary /root/miniconda3/envs/anima/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda121.so... ./run_Amina_training.sh: line 48: 36300 Segmentation fault (core dumped) python qlora.py --dataset="chinese-vicuna" --dataset_format="alpaca-clean" #alpaca-clean has similar format to chinese training dataset --learning_rate 0.0001 # QLoRA paper appendix B Table 9 --per_device_train_batch_size 1 # fix for fitting mem --gradient_accumulation_steps 16 # QLoRA paper appendix B Table 9 --max_steps 10000 # QLoRA paper appendix B Table 9, follow paper setting even though cn data is 690k much bigger than OASST1 9k, batch size considering accum --model_name_or_path "timdettmers/guanaco-33b-merged" --source_max_len 512 # default setting in code, cn model 2048 too long --target_max_len 512 # follow QLoRA paper appendix B Table 9 --eval_dataset_size 1 # mainly for testing, no need to be big --do_eval --evaluation_strategy "steps" --eval_steps 200 # 10 for debug mode only, 200 for training --output_dir $OUTPUT_PATH --report_to 'wandb' --sample_generate # test sample generation every once a while --save_steps 200 # 20 for debug mode only, 200 for training

lyogavin commented 1 year ago

有pip的package吗?pip freeze一下? GPU型号有吗?我可以尝试reproduce一下。

acbogeh commented 1 year ago

有pip的package吗?pip freeze一下? GPU型号有吗?我可以尝试reproduce一下。

(anima) [root@LLM01GPU Anima]# pip freeze accelerate==0.20.3 aiohttp==3.8.4 aiosignal==1.3.1 appdirs==1.4.4 async-timeout==4.0.2 attrs==23.1.0 bitsandbytes==0.39.0 certifi==2023.5.7 charset-normalizer==3.1.0 click==8.1.3 cmake==3.26.4 datasets==2.13.0 dill==0.3.6 docker-pycreds==0.4.0 einops==0.6.1 evaluate==0.4.0 filelock==3.12.2 frozenlist==1.3.3 fsspec==2023.6.0 gitdb==4.0.10 GitPython==3.1.31 huggingface-hub==0.15.1 idna==3.4 Jinja2==3.1.2 joblib==1.2.0 lit==16.0.6 MarkupSafe==2.1.3 mpmath==1.3.0 multidict==6.0.4 multiprocess==0.70.14 networkx==3.1 numpy==1.24.3 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 packaging==23.1 pandas==2.0.2 pathtools==0.1.2 peft==0.3.0 protobuf==4.23.3 psutil==5.9.5 pyarrow==12.0.1 python-dateutil==2.8.2 pytz==2023.3 PyYAML==6.0 regex==2023.6.3 requests==2.31.0 responses==0.18.0 scikit-learn==1.2.2 scipy==1.10.1 sentencepiece==0.1.99 sentry-sdk==1.25.1 setproctitle==1.3.2 six==1.16.0 smmap==5.0.0 sympy==1.12 threadpoolctl==3.1.0 tokenizers==0.13.3 torch==2.0.1 tqdm==4.65.0 transformers==4.29.1 triton==2.0.0 typing_extensions==4.6.3 tzdata==2023.3 urllib3==2.0.3 wandb==0.15.3 xxhash==3.2.0 yarl==1.9.2

GPU A100 80G

lyogavin commented 1 year ago

cuda toolkit的版本有没有mismatch? 我看你的log里用的是12.1: packages/bitsandbytes/libbitsandbytes_cuda121.so,但是pip里是11: nvidia-cublas-cu11,重装一下cuda tookit试一下?