Closed lylcst closed 10 months ago
提供依赖库版本
提供依赖库版本
torch cuda version ............... 11.7 nvcc version ..................... 11.7 torch version .................... 2.0.1
Package Version
absl-py 2.0.0 accelerate 0.23.0 adal 1.2.7 aiofiles 23.2.1 aiohttp 3.8.5 aiosignal 1.3.1 altair 5.1.2 antlr4-python3-runtime 4.9.3 anyio 4.0.0 apex 0.1 appdirs 1.4.4 applicationinsights 0.11.10 argcomplete 2.1.2 asttokens 2.4.0 async-timeout 4.0.3 attrs 23.1.0 azure-common 1.1.28 azure-core 1.29.4 azure-graphrbac 0.61.1 azure-identity 1.14.0 azure-mgmt-authorization 3.0.0 azure-mgmt-containerregistry 10.2.0 azure-mgmt-core 1.4.0 azure-mgmt-keyvault 10.2.3 azure-mgmt-network 21.0.1 azure-mgmt-resource 22.0.0 azure-mgmt-storage 21.0.0 azure-ml 0.0.1 azure-ml-component 0.9.18.post2 azure-storage-blob 12.13.0 azureml-automl-common-tools 1.53.0 azureml-contrib-services 1.53.0 azureml-core 1.53.0 azureml-dataprep 4.12.4 azureml-dataprep-native 38.0.0 azureml-dataprep-rslex 2.19.5 azureml-dataset-runtime 1.53.0 azureml-defaults 1.53.0 azureml-inference-server-http 0.8.4.1 azureml-mlflow 1.53.0 azureml-telemetry 1.53.0 backcall 0.2.0 backports.tempfile 1.0 backports.weakref 1.0.post1 bcrypt 4.0.1 bytecode 0.15.0 cachetools 5.3.1 Cerberus 1.3.5 certifi 2023.7.22 cffi 1.15.1 charset-normalizer 3.2.0 click 8.1.7 cloudpickle 2.2.1 cmake 3.27.5 comm 0.1.4 contextlib2 21.6.0 coverage 6.3.1 cryptography 41.0.3 cycler 0.11.0 databricks-cli 0.17.8 datasets 2.14.5 debugpy 1.6.7.post1 decorator 5.1.1 deepspeed 0.9.5 dill 0.3.7 distro 1.8.0 docker 6.1.3 docker-pycreds 0.4.0 docstring-parser 0.15 dotnetcore2 3.1.23 einops 0.7.0 entrypoints 0.4 exceptiongroup 1.1.3 executing 1.2.0 fairscale 0.4.13 fastapi 0.95.1 ffmpy 0.3.1 filelock 3.12.4 fire 0.5.0 Flask 2.2.5 Flask-Cors 3.0.10 flatbuffers 23.5.26 fonttools 4.42.1 frozenlist 1.4.0 fsspec 2023.6.0 fusepy 3.0.1 gitdb 4.0.10 GitPython 3.1.37 google-api-core 2.11.1 google-auth 2.23.0 google-auth-oauthlib 1.0.0 googleapis-common-protos 1.60.0 gradio 3.50.2 gradio_client 0.6.1 grpcio 1.58.0 gunicorn 20.1.0 h11 0.14.0 h5py 3.9.0 hjson 3.1.0 horovod 0.24.2 httpcore 1.0.1 httpx 0.25.1 huggingface-hub 0.17.3 humanfriendly 10.0 idna 3.4 igraph 0.10.8 importlib-metadata 6.8.0 importlib-resources 6.1.0 inference-schema 1.5.1 iniconfig 2.0.0 intel-openmp 2021.4.0 ipykernel 6.25.2 ipython 8.12.2 isodate 0.6.1 itsdangerous 2.1.2 jedi 0.19.0 jeepney 0.8.0 jieba 0.42.1 Jinja2 3.1.2 jmespath 1.0.1 joblib 1.3.2 jsonpickle 3.0.2 jsonschema 4.19.1 jsonschema-specifications 2023.7.1 jupyter_client 8.3.1 jupyter_core 5.3.2 kiwisolver 1.4.5 knack 0.10.1 lightning-utilities 0.9.0 lit 16.0.6 lxml 4.9.3 Markdown 3.4.4 markdown-it-py 3.0.0 MarkupSafe 2.1.2 matplotlib 3.5.3 matplotlib-inline 0.1.6 mdurl 0.1.2 mkl 2021.4.0 mkl-include 2021.4.0 mlflow-skinny 2.7.1 mpi4py 3.1.1 mpmath 1.3.0 msal 1.24.0 msal-extensions 1.0.0 msccl 2.3.0 msrest 0.7.1 msrestazure 0.6.4 multidict 6.0.4 multiprocess 0.70.15 ndg-httpsclient 0.5.1 nebulaml 0.16.5 nest-asyncio 1.5.8 networkx 3.1 ninja 1.10.2 nltk 3.8.1 numpy 1.22.2 oauthlib 3.2.2 omegaconf 2.3.0 onnx 1.14.1 onnxruntime-training 1.15.1+cu118 opencensus 0.11.2 opencensus-context 0.1.3 opencensus-ext-azure 1.1.9 opencensus-ext-logging 0.1.1 optree 0.9.2 orjson 3.9.10 packaging 23.0 pandas 2.0.3 paramiko 3.3.1 parso 0.8.3 pathspec 0.11.2 pathtools 0.1.2 peft 0.6.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.0.1 pip 23.2.1 pkginfo 1.9.6 pkgutil_resolve_name 1.3.10 platformdirs 3.10.0 pluggy 1.3.0 portalocker 2.8.2 prompt-toolkit 3.0.39 protobuf 3.20.3 psutil 5.8.0 ptyprocess 0.7.0 pure-eval 0.2.2 py 1.11.0 py-cpuinfo 5.0.0 py-spy 0.3.12 pyarrow 11.0.0 pyasn1 0.5.0 pyasn1-modules 0.3.0 pybind11 2.11.1 pycparser 2.21 pydantic 1.10.11 pydash 7.0.6 pydub 0.25.1 Pygments 2.16.1 PyJWT 2.8.0 PyNaCl 1.5.0 pyOpenSSL 23.2.0 pyparsing 3.1.1 PySocks 1.7.1 pytest 7.1.0 pytest-mpi 0.6 python-dateutil 2.8.2 python-multipart 0.0.6 pytorch-lightning 1.9.3 pytz 2023.3.post1 PyYAML 6.0.1 pyzmq 25.1.1 referencing 0.30.2 regex 2023.8.8 requests 2.31.0 requests-oauthlib 1.3.1 rich 13.5.3 rouge-chinese 1.0.3 rpds-py 0.10.3 rsa 4.9 ruamel.yaml 0.17.16 ruamel.yaml.clib 0.2.7 safetensors 0.3.3 scipy 1.10.1 SecretStorage 3.3.3 semantic-version 2.10.0 sentencepiece 0.1.99 sentry-sdk 1.31.0 setproctitle 1.3.2 setuptools 67.6.0 shtab 1.6.4 six 1.16.0 smmap 5.0.1 sniffio 1.3.0 sqlparse 0.4.4 sse-starlette 1.6.5 stack-data 0.6.2 starlette 0.26.1 supervisor 4.2.5 sympy 1.12 tabulate 0.9.0 tbb 2021.10.0 tensorboard 2.14.0 tensorboard-data-server 0.7.1 termcolor 2.3.0 texttable 1.6.7 tiktoken 0.5.1 tokenizers 0.13.3 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 2.0.1 torch-nebula 0.16.5 torch-ort 1.15.0 torch-tb-profiler 0.4.1 torchaudio 2.0.2+cu117 torchmetrics 0.11.3 torchsnapshot 0.1.0 torchvision 0.15.2+cu117 tornado 6.3.3 tqdm 4.62.3 traitlets 5.10.1 transformers 4.33.3 transformers-stream-generator 0.0.4 triton 2.0.0 trl 0.7.2 tutel 0.1 typing_extensions 4.8.0 tyro 0.5.12 tzdata 2023.3 urllib3 1.26.16 uvicorn 0.24.0 wandb 0.15.11 wcwidth 0.2.6 websocket-client 1.6.3 websockets 11.0.3 Werkzeug 2.3.7 wheel 0.40.0 wrapt 1.12.1 xxhash 3.3.0 yarl 1.9.2 z3-solver 4.12.2.0 zipp 3.16.2
是 mlflow 引起的问题 https://github.com/mlflow/mlflow/issues/3536
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage dpo \ --model_name_or_path model_path/qwen \ --do_train \ --dataset comparison_gpt4_en \ --template chatml \ --finetuning_type lora \ --lora_target c_attn \ --resume_lora_training False \ --output_dir output/test \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 1e-5 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16 \ --overwrite_output_dir \ --cutoff_len 1024
[INFO|trainer.py:1686] 2023-11-05 12:24:31,327 >> Running training [INFO|trainer.py:1687] 2023-11-05 12:24:31,327 >> Num examples = 36,441 [INFO|trainer.py:1688] 2023-11-05 12:24:31,327 >> Num Epochs = 1 [INFO|trainer.py:1689] 2023-11-05 12:24:31,327 >> Instantaneous batch size per device = 4 [INFO|trainer.py:1692] 2023-11-05 12:24:31,327 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1693] 2023-11-05 12:24:31,327 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1694] 2023-11-05 12:24:31,327 >> Total optimization steps = 2,277 [INFO|trainer.py:1695] 2023-11-05 12:24:31,328 >> Number of trainable parameters = 4,194,304 Traceback (most recent call last): File "src/train_bash.py", line 14, in
main()
File "src/train_bash.py", line 5, in main
run_exp()
File "/scratch/AzureBlobStorage_INPUT2/data/users/v-yulinli/workspace/LLaMA-Factory/src/llmtuner/tuner/tune.py", line 32, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/scratch/AzureBlobStorage_INPUT2/data/users/v-yulinli/workspace/LLaMA-Factory/src/llmtuner/tuner/dpo/workflow.py", line 54, in run_dpo
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1752, in _inner_training_loop
self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/trainer_callback.py", line 353, in on_train_begin
return self.call_event("on_train_begin", args, state, control)
File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/trainer_callback.py", line 397, in call_event
result = getattr(callback, event)(
File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/integrations.py", line 1017, in on_train_begin
self.setup(args, state, model)
File "/home/aiscuser/.local/lib/python3.8/site-packages/transformers/integrations.py", line 1008, in setup
self._ml_flow.log_params(dict(combined_dict_items[i : i + self._MAX_PARAMS_TAGS_PER_BATCH]))
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/tracking/fluent.py", line 755, in log_params
MlflowClient().log_batch(run_id=run_id, metrics=[], params=params_arr, tags=[])
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/tracking/client.py", line 1038, in log_batch
self._tracking_client.log_batch(run_id, metrics, params, tags)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py", line 389, in log_batch
self.store.log_batch(
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/store/tracking/rest_store.py", line 323, in log_batch
self._call_endpoint(LogBatch, req_body)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/store/tracking/rest_store.py", line 59, in _call_endpoint
return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/utils/rest_utils.py", line 202, in call_endpoint
response = verify_rest_response(response, endpoint)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/mlflow/utils/rest_utils.py", line 134, in verify_rest_response
raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: Response: {'Error': {'Code': 'ValidationError', 'Severity': None, 'Message': 'A given key of Parameters can not modify its value after it is set', 'MessageFormat': None, 'MessageParameters': None, 'ReferenceCode': None, 'DetailsUri': None, 'Target': None, 'Details': [], 'InnerError': None, 'DebugInfo': None, 'AdditionalInfo': None}, 'Correlation': {'operation': 'a2aff95b090ece7a1949d522ac9b2325', 'request': '4878ba5982aa4ea5'}, 'Environment': 'southcentralus', 'Location': 'southcentralus', 'Time': '2023-11-05T12:24:32.3101524+00:00', 'ComponentName': 'mlflow', 'statusCode': 400, 'error_code': 'INVALID_PARAMETER_VALUE'}