dpo训练报错 - Githubissues

lylcst commented 10 months ago

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage dpo \ --model_name_or_path model_path/qwen \ --do_train \ --dataset comparison_gpt4_en \ --template chatml \ --finetuning_type lora \ --lora_target c_attn \ --resume_lora_training False \ --output_dir output/test \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 1e-5 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16 \ --overwrite_output_dir \ --cutoff_len 1024

[INFO|trainer.py:1686] 2023-11-05 12:24:31,327 >> Running training [INFO|trainer.py:1687] 2023-11-05 12:24:31,327 >> Num examples = 36,441 [INFO|trainer.py:1688] 2023-11-05 12:24:31,327 >> Num Epochs = 1 [INFO|trainer.py:1689] 2023-11-05 12:24:31,327 >> Instantaneous batch size per device = 4 [INFO|trainer.py:1692] 2023-11-05 12:24:31,327 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1693] 2023-11-05 12:24:31,327 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1694] 2023-11-05 12:24:31,327 >> Total optimization steps = 2,277 [INFO|trainer.py:1695] 2023-11-05 12:24:31,328 >> Number of trainable parameters = 4,194,304 Traceback (most recent call last): File "src/train_bash.py", line 14, in main() File "src/train_bash.py", line 5, in main run_exp() File "/scratch/AzureBlobStorage_INPUT2/data/users/v-yulinli/workspace/LLaMA-Factory/src/llmtuner/tuner/tune.py", line 32, in run_exp run_dpo(model_args, data_args, training_args, finetuning_args, callbacks) File "/scratch/AzureBlobStorage_INPUT2/data/users/v-yulinli/workspace/LLaMA-Factory/src/llmtuner/tuner/dpo/workflow.py", line 54, in run_dpo train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1752, in _inner_training_loop self.control = self.callback_handler.on_train_begin(args, self.state, self.control) File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/trainer_callback.py", line 353, in on_train_begin return self.call_event("on_train_begin", args, state, control) File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/trainer_callback.py", line 397, in call_event result = getattr(callback, event)( File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/integrations.py", line 1017, in on_train_begin self.setup(args, state, model) File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/integrations.py", line 1008, in setup self._ml_flow.log_params(dict(combined_dict_items[i : i + self._MAX_PARAMS_TAGS_PER_BATCH])) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/tracking/fluent.py", line 755, in log_params MlflowClient().log_batch(run_id=run_id, metrics=[], params=params_arr, tags=[]) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/tracking/client.py", line 1038, in log_batch self._tracking_client.log_batch(run_id, metrics, params, tags) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py", line 389, in log_batch self.store.log_batch( File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/store/tracking/rest_store.py", line 323, in log_batch self._call_endpoint(LogBatch, req_body) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/store/tracking/rest_store.py", line 59, in _call_endpoint return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/utils/rest_utils.py", line 202, in call_endpoint response = verify_rest_response(response, endpoint) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/utils/rest_utils.py", line 134, in verify_rest_response raise RestException(json.loads(response.text)) mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: Response: {'Error': {'Code': 'ValidationError', 'Severity': None, 'Message': 'A given key of Parameters can not modify its value after it is set', 'MessageFormat': None, 'MessageParameters': None, 'ReferenceCode': None, 'DetailsUri': None, 'Target': None, 'Details': [], 'InnerError': None, 'DebugInfo': None, 'AdditionalInfo': None}, 'Correlation': {'operation': 'a2aff95b090ece7a1949d522ac9b2325', 'request': '4878ba5982aa4ea5'}, 'Environment': 'southcentralus', 'Location': 'southcentralus', 'Time': '2023-11-05T12:24:32.3101524+00:00', 'ComponentName': 'mlflow', 'statusCode': 400, 'error_code': 'INVALID_PARAMETER_VALUE'}

hiyouga commented 10 months ago

提供依赖库版本

lylcst commented 10 months ago

提供依赖库版本

torch cuda version ............... 11.7 nvcc version ..................... 11.7 torch version .................... 2.0.1

Package Version

absl-py 2.0.0 accelerate 0.23.0 adal 1.2.7 aiofiles 23.2.1 aiohttp 3.8.5 aiosignal 1.3.1 altair 5.1.2 antlr4-python3-runtime 4.9.3 anyio 4.0.0 apex 0.1 appdirs 1.4.4 applicationinsights 0.11.10 argcomplete 2.1.2 asttokens 2.4.0 async-timeout 4.0.3 attrs 23.1.0 azure-common 1.1.28 azure-core 1.29.4 azure-graphrbac 0.61.1 azure-identity 1.14.0 azure-mgmt-authorization 3.0.0 azure-mgmt-containerregistry 10.2.0 azure-mgmt-core 1.4.0 azure-mgmt-keyvault 10.2.3 azure-mgmt-network 21.0.1 azure-mgmt-resource 22.0.0 azure-mgmt-storage 21.0.0 azure-ml 0.0.1 azure-ml-component 0.9.18.post2 azure-storage-blob 12.13.0 azureml-automl-common-tools 1.53.0 azureml-contrib-services 1.53.0 azureml-core 1.53.0 azureml-dataprep 4.12.4 azureml-dataprep-native 38.0.0 azureml-dataprep-rslex 2.19.5 azureml-dataset-runtime 1.53.0 azureml-defaults 1.53.0 azureml-inference-server-http 0.8.4.1 azureml-mlflow 1.53.0 azureml-telemetry 1.53.0 backcall 0.2.0 backports.tempfile 1.0 backports.weakref 1.0.post1 bcrypt 4.0.1 bytecode 0.15.0 cachetools 5.3.1 Cerberus 1.3.5 certifi 2023.7.22 cffi 1.15.1 charset-normalizer 3.2.0 click 8.1.7 cloudpickle 2.2.1 cmake 3.27.5 comm 0.1.4 contextlib2 21.6.0 coverage 6.3.1 cryptography 41.0.3 cycler 0.11.0 databricks-cli 0.17.8 datasets 2.14.5 debugpy 1.6.7.post1 decorator 5.1.1 deepspeed 0.9.5 dill 0.3.7 distro 1.8.0 docker 6.1.3 docker-pycreds 0.4.0 docstring-parser 0.15 dotnetcore2 3.1.23 einops 0.7.0 entrypoints 0.4 exceptiongroup 1.1.3 executing 1.2.0 fairscale 0.4.13 fastapi 0.95.1 ffmpy 0.3.1 filelock 3.12.4 fire 0.5.0 Flask 2.2.5 Flask-Cors 3.0.10 flatbuffers 23.5.26 fonttools 4.42.1 frozenlist 1.4.0 fsspec 2023.6.0 fusepy 3.0.1 gitdb 4.0.10 GitPython 3.1.37 google-api-core 2.11.1 google-auth 2.23.0 google-auth-oauthlib 1.0.0 googleapis-common-protos 1.60.0 gradio 3.50.2 gradio_client 0.6.1 grpcio 1.58.0 gunicorn 20.1.0 h11 0.14.0 h5py 3.9.0 hjson 3.1.0 horovod 0.24.2 httpcore 1.0.1 httpx 0.25.1 huggingface-hub 0.17.3 humanfriendly 10.0 idna 3.4 igraph 0.10.8 importlib-metadata 6.8.0 importlib-resources 6.1.0 inference-schema 1.5.1 iniconfig 2.0.0 intel-openmp 2021.4.0 ipykernel 6.25.2 ipython 8.12.2 isodate 0.6.1 itsdangerous 2.1.2 jedi 0.19.0 jeepney 0.8.0 jieba 0.42.1 Jinja2 3.1.2 jmespath 1.0.1 joblib 1.3.2 jsonpickle 3.0.2 jsonschema 4.19.1 jsonschema-specifications 2023.7.1 jupyter_client 8.3.1 jupyter_core 5.3.2 kiwisolver 1.4.5 knack 0.10.1 lightning-utilities 0.9.0 lit 16.0.6 lxml 4.9.3 Markdown 3.4.4 markdown-it-py 3.0.0 MarkupSafe 2.1.2 matplotlib 3.5.3 matplotlib-inline 0.1.6 mdurl 0.1.2 mkl 2021.4.0 mkl-include 2021.4.0 mlflow-skinny 2.7.1 mpi4py 3.1.1 mpmath 1.3.0 msal 1.24.0 msal-extensions 1.0.0 msccl 2.3.0 msrest 0.7.1 msrestazure 0.6.4 multidict 6.0.4 multiprocess 0.70.15 ndg-httpsclient 0.5.1 nebulaml 0.16.5 nest-asyncio 1.5.8 networkx 3.1 ninja 1.10.2 nltk 3.8.1 numpy 1.22.2 oauthlib 3.2.2 omegaconf 2.3.0 onnx 1.14.1 onnxruntime-training 1.15.1+cu118 opencensus 0.11.2 opencensus-context 0.1.3 opencensus-ext-azure 1.1.9 opencensus-ext-logging 0.1.1 optree 0.9.2 orjson 3.9.10 packaging 23.0 pandas 2.0.3 paramiko 3.3.1 parso 0.8.3 pathspec 0.11.2 pathtools 0.1.2 peft 0.6.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.0.1 pip 23.2.1 pkginfo 1.9.6 pkgutil_resolve_name 1.3.10 platformdirs 3.10.0 pluggy 1.3.0 portalocker 2.8.2 prompt-toolkit 3.0.39 protobuf 3.20.3 psutil 5.8.0 ptyprocess 0.7.0 pure-eval 0.2.2 py 1.11.0 py-cpuinfo 5.0.0 py-spy 0.3.12 pyarrow 11.0.0 pyasn1 0.5.0 pyasn1-modules 0.3.0 pybind11 2.11.1 pycparser 2.21 pydantic 1.10.11 pydash 7.0.6 pydub 0.25.1 Pygments 2.16.1 PyJWT 2.8.0 PyNaCl 1.5.0 pyOpenSSL 23.2.0 pyparsing 3.1.1 PySocks 1.7.1 pytest 7.1.0 pytest-mpi 0.6 python-dateutil 2.8.2 python-multipart 0.0.6 pytorch-lightning 1.9.3 pytz 2023.3.post1 PyYAML 6.0.1 pyzmq 25.1.1 referencing 0.30.2 regex 2023.8.8 requests 2.31.0 requests-oauthlib 1.3.1 rich 13.5.3 rouge-chinese 1.0.3 rpds-py 0.10.3 rsa 4.9 ruamel.yaml 0.17.16 ruamel.yaml.clib 0.2.7 safetensors 0.3.3 scipy 1.10.1 SecretStorage 3.3.3 semantic-version 2.10.0 sentencepiece 0.1.99 sentry-sdk 1.31.0 setproctitle 1.3.2 setuptools 67.6.0 shtab 1.6.4 six 1.16.0 smmap 5.0.1 sniffio 1.3.0 sqlparse 0.4.4 sse-starlette 1.6.5 stack-data 0.6.2 starlette 0.26.1 supervisor 4.2.5 sympy 1.12 tabulate 0.9.0 tbb 2021.10.0 tensorboard 2.14.0 tensorboard-data-server 0.7.1 termcolor 2.3.0 texttable 1.6.7 tiktoken 0.5.1 tokenizers 0.13.3 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 2.0.1 torch-nebula 0.16.5 torch-ort 1.15.0 torch-tb-profiler 0.4.1 torchaudio 2.0.2+cu117 torchmetrics 0.11.3 torchsnapshot 0.1.0 torchvision 0.15.2+cu117 tornado 6.3.3 tqdm 4.62.3 traitlets 5.10.1 transformers 4.33.3 transformers-stream-generator 0.0.4 triton 2.0.0 trl 0.7.2 tutel 0.1 typing_extensions 4.8.0 tyro 0.5.12 tzdata 2023.3 urllib3 1.26.16 uvicorn 0.24.0 wandb 0.15.11 wcwidth 0.2.6 websocket-client 1.6.3 websockets 11.0.3 Werkzeug 2.3.7 wheel 0.40.0 wrapt 1.12.1 xxhash 3.3.0 yarl 1.9.2 z3-solver 4.12.2.0 zipp 3.16.2

hiyouga commented 10 months ago

是 mlflow 引起的问题 https://github.com/mlflow/mlflow/issues/3536

hiyouga / LLaMA-Factory

dpo训练报错 #1390