hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
30.76k stars 3.79k forks source link

Ascend 910B卡只能跑通单GPU LoRA 微调的(增量)预训练,跑其他任务全部报如下相同的错误,帮忙看下什么原因,感谢 #3670

Closed guoyjalihy closed 3 months ago

guoyjalihy commented 4 months ago

Reminder

Reproduction

Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 12859 examples [00:00, 16740.09 examples/s] Generating train split: 12859 examples [00:00, 16567.62 examples/s]

Converting format of dataset (num_proc=16): 0%| | 0/1000 [00:00<?, ? examples/s] Converting format of dataset (num_proc=16): 100%|██████████| 1000/1000 [00:00<00:00, 5854.70 examples/s]

Running tokenizer on dataset (num_proc=16): 0%| | 0/1000 [00:00<?, ? examples/s] Running tokenizer on dataset (num_proc=16): 6%|▋ | 63/1000 [00:00<00:10, 93.61 examples/s] Running tokenizer on dataset (num_proc=16): 13%|█▎ | 126/1000 [00:00<00:04, 176.72 examples/s] Running tokenizer on dataset (num_proc=16): 19%|█▉ | 189/1000 [00:00<00:03, 245.79 examples/s] Running tokenizer on dataset (num_proc=16): 25%|██▌ | 252/1000 [00:01<00:02, 299.30 examples/s] Running tokenizer on dataset (num_proc=16): 32%|███▏ | 315/1000 [00:01<00:01, 344.65 examples/s] Running tokenizer on dataset (num_proc=16): 38%|███▊ | 378/1000 [00:01<00:01, 381.27 examples/s] Running tokenizer on dataset (num_proc=16): 44%|████▍ | 441/1000 [00:01<00:01, 407.50 examples/s] Running tokenizer on dataset (num_proc=16): 50%|█████ | 504/1000 [00:01<00:01, 422.60 examples/s] Running tokenizer on dataset (num_proc=16): 57%|█████▋ | 566/1000 [00:01<00:00, 436.37 examples/s] Running tokenizer on dataset (num_proc=16): 63%|██████▎ | 628/1000 [00:01<00:00, 449.07 examples/s] Running tokenizer on dataset (num_proc=16): 69%|██████▉ | 690/1000 [00:02<00:00, 454.77 examples/s] Running tokenizer on dataset (num_proc=16): 75%|███████▌ | 752/1000 [00:02<00:00, 457.53 examples/s] Running tokenizer on dataset (num_proc=16): 88%|████████▊ | 876/1000 [00:02<00:00, 514.74 examples/s] Running tokenizer on dataset (num_proc=16): 94%|█████████▍| 938/1000 [00:02<00:00, 506.34 examples/s] Running tokenizer on dataset (num_proc=16): 100%|██████████| 1000/1000 [00:02<00:00, 500.64 examples/s] Running tokenizer on dataset (num_proc=16): 100%|██████████| 1000/1000 [00:02<00:00, 371.30 examples/s] [INFO|configuration_utils.py:726] 2024-05-10 02:40:36,773 >> loading configuration file /data/mlops/models/Meta-Llama-3-8B-Instruct/config.json [INFO|configuration_utils.py:791] 2024-05-10 02:40:36,774 >> Model config LlamaConfig { "_name_or_path": "/data/mlops/models/Meta-Llama-3-8B-Instruct", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.38.2", "use_cache": true, "vocab_size": 128256 }

[INFO|modeling_utils.py:3254] 2024-05-10 02:40:36,801 >> loading weights file /data/mlops/models/Meta-Llama-3-8B-Instruct/model.safetensors.index.json [INFO|modeling_utils.py:1400] 2024-05-10 02:40:36,803 >> Instantiating LlamaForCausalLM model under default dtype torch.float16. [INFO|configuration_utils.py:845] 2024-05-10 02:40:36,804 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001 }

prompt_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 18328, 13, 128009, 128006, 882, 128007, 271, 2675, 690, 387, 2728, 264, 7419, 315, 264, 3465, 1176, 11, 1243, 1063, 1988, 315, 279, 3465, 627, 2028, 3465, 374, 922, 1701, 279, 5300, 11914, 323, 34537, 279, 11914, 311, 12027, 7817, 24686, 320, 49, 5375, 8, 24657, 2641, 315, 279, 1376, 320, 11760, 11, 25269, 1665, 570, 578, 69499, 24657, 2641, 8066, 2011, 387, 1778, 430, 279, 24657, 2641, 30357, 12602, 279, 6070, 323, 53794, 315, 279, 1988, 11914, 13, 578, 1988, 374, 264, 11914, 323, 279, 2612, 374, 264, 1160, 315, 24657, 2641, 315, 279, 1376, 510, 11760, 11, 25269, 11, 1665, 60, 430, 12602, 279, 12135, 3118, 304, 279, 11914, 13, 3277, 264, 11914, 706, 810, 1109, 220, 16, 69499, 99809, 3284, 11, 279, 2612, 2011, 6782, 682, 315, 1124, 382, 32, 6897, 42262, 320, 309, 36306, 22367, 82, 5015, 374, 18707, 29836, 1611, 2057, 1247, 316, 267, 1405, 42262, 30160, 16192, 1101, 1514, 627, 5207, 25, 128009, 128006, 78191, 128007, 271] prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

You will be given a definition of a task first, then some input of the task. This task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.

AFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play. Output:<|eot_id|><|start_header_id|>assistant<|end_header_id|>

chosen_ids: [9837, 220, 4482, 32, 6897, 42262, 320, 309, 36306, 11844, 330, 4752, 5015, 498, 330, 59837, 29836, 1611, 2057, 1247, 316, 267, 8257, 220, 4482, 41177, 30160, 16192, 498, 330, 28897, 520, 498, 330, 59837, 29836, 1611, 2057, 1247, 316, 267, 7171, 60, 128009] chosen: [ ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"], ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"] ]<|eot_id|> rejected_ids: [40914, 11, 358, 4265, 387, 6380, 311, 1520, 0, 5810, 527, 279, 69499, 24657, 2641, 369, 279, 1988, 11914, 1473, 23335, 6897, 42262, 320, 309, 36306, 705, 706, 31814, 11, 18707, 29836, 1611, 2057, 1247, 316, 267, 933, 23335, 14858, 30160, 16192, 11, 11335, 1688, 11, 18707, 29836, 1611, 2057, 1247, 316, 267, 2595, 70869, 1473, 9, 64636, 42262, 320, 309, 36306, 8, 374, 279, 3917, 315, 279, 1176, 99809, 11, 323, 706, 31814, 374, 279, 25269, 430, 16964, 279, 5133, 1990, 64636, 42262, 320, 309, 36306, 8, 323, 18707, 29836, 1611, 2057, 1247, 316, 267, 627, 9, 42262, 30160, 16192, 374, 279, 3917, 315, 279, 2132, 99809, 11, 323, 11335, 1688, 374, 279, 25269, 430, 16964, 279, 5133, 1990, 42262, 30160, 16192, 323, 18707, 29836, 1611, 2057, 1247, 316, 267, 382, 9290, 430, 1070, 1253, 387, 1023, 3284, 69499, 24657, 2641, 430, 1436, 387, 14592, 505, 279, 1988, 11914, 11, 719, 279, 3485, 24657, 2641, 12602, 279, 1925, 12135, 3118, 304, 279, 11914, 13, 128009] rejected: Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:

[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst] [Ajax Youth Academy, playsAt, Sportpark De Toekomst]

Explanation:

Note that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.<|eot_id|>

Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:12<00:36, 12.20s/it] Loading checkpoint shards: 50%|█████ | 2/4 [00:20<00:19, 9.61s/it] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:28<00:09, 9.05s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:29<00:00, 6.08s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:29<00:00, 7.47s/it] [INFO|modeling_utils.py:3992] 2024-05-10 02:41:06,863 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4000] 2024-05-10 02:41:06,863 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /data/mlops/models/Meta-Llama-3-8B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:798] 2024-05-10 02:41:06,868 >> loading configuration file /data/mlops/models/Meta-Llama-3-8B-Instruct/generation_config.json [INFO|configuration_utils.py:845] 2024-05-10 02:41:06,868 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": [ 128001, 128009 ], "max_length": 4096, "temperature": 0.6, "top_p": 0.9 }

05/10/2024 02:41:06 - INFO - llmtuner.model.utils.checkpointing - Gradient checkpointing enabled. 05/10/2024 02:41:06 - INFO - llmtuner.model.utils.attention - Using torch SDPA for faster training and inference. 05/10/2024 02:41:06 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA 05/10/2024 02:41:08 - INFO - llmtuner.model.loader - trainable params: 3407872 || all params: 8033669120 || trainable%: 0.0424 Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO|trainer.py:601] 2024-05-10 02:41:08,095 >> Using auto half precision backend [INFO|trainer.py:1812] 2024-05-10 02:41:09,500 >> Running training [INFO|trainer.py:1813] 2024-05-10 02:41:09,500 >> Num examples = 900 [INFO|trainer.py:1814] 2024-05-10 02:41:09,501 >> Num Epochs = 3 [INFO|trainer.py:1815] 2024-05-10 02:41:09,501 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1818] 2024-05-10 02:41:09,501 >> Total train batch size (w. parallel, distributed & accumulation) = 8 [INFO|trainer.py:1819] 2024-05-10 02:41:09,501 >> Gradient Accumulation steps = 8 [INFO|trainer.py:1820] 2024-05-10 02:41:09,501 >> Total optimization steps = 336 [INFO|trainer.py:1821] 2024-05-10 02:41:09,502 >> Number of trainable parameters = 3,407,872

0%| | 0/336 [00:00<?, ?it/s][W NeKernelNpu.cpp:45] Warning: The oprator of ne is executed, Currently High Accuracy but Low Performance OP with 64-bit has been used, Please Do Some Cast at Python Functions with 32-bit for Better Performance! (function operator())

\ | /

\ | /

\ | [W AclInterface.cpp:181] Warning: 0Failed to find function aclrtCreateEventExWithFlag (function operator()) [W OpCommand.cpp:88] Warning: [Check][offset] Check input storage_offset[%ld] = 0 failed, result is untrustworthy64 (function operator()) /

\ | /

\ | /

\ EZ9999: Inner Error! EZ9999 Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1677] TraceBack (most recent call last): The error from device(0), serial number is 1, there is an aicore error, core id is 0, error code = 0x800000, dump info: pc start: 0x10001240801d3000, current: 0x1240801d3268, vec error info: 0x13cf485c, mte error info: 0x60c41a2, ifu error info: 0x3ef5712eeaa00, ccu error info: 0, cube error info: 0x1a, biu error info: 0, aic error mask: 0x65000200d000288, para base: 0x1241000ba400, errorStr: The DDR address of the MTE instruction is out of range.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:535] The extend info from device(0), serial number is 1, there is aicore error, core id is 0, aicore int: 0x81, aicore error2: 0, axi clamp ctrl: 0, axi clamp state: 0x1717, biu status0: 0x101e44800000000, biu status1: 0x940002092a0000, clk gate mask: 0x1000, dbg addr: 0, ecc en: 0, mte ccu ecc 1bit error: 0, vector cube ecc 1bit error: 0, run stall: 0x1, dbg data0: 0, dbg data1: 0, dbg data2: 0, dbg data3: 0, dfx data: 0[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:566] The dha(mata) info from device(0), dha id is 0, dha status 1 info:0xf[FUNC:ProcessCoreErrorInfo][FILE:device_error_proc.cc][LINE:656] The dha(mata) info from device(0), dha id is 1, dha status 1 info:0xf[FUNC:ProcessCoreErrorInfo][FILE:device_error_proc.cc][LINE:656] The dha(mata) info from device(0), dha id is 2, dha status 1 info:0xf[FUNC:ProcessCoreErrorInfo][FILE:device_error_proc.cc][LINE:656] The dha(mata) info from device(0), dha id is 3, dha status 1 info:0xf[FUNC:ProcessCoreErrorInfo][FILE:device_error_proc.cc][LINE:656] The dha(mata) info from device(0), dha id is 4, dha status 1 info:0xf[FUNC:ProcessCoreErrorInfo][FILE:device_error_proc.cc][LINE:656] The dha(mata) info from device(0), dha id is 5, dha status 1 info:0xf[FUNC:ProcessCoreErrorInfo][FILE:device_error_proc.cc][LINE:656] The dha(mata) info from device(0), dha id is 6, dha status 1 info:0xf[FUNC:ProcessCoreErrorInfo][FILE:device_error_proc.cc][LINE:656] The dha(mata) info from device(0), dha id is 7, dha status 1 info:0xf[FUNC:ProcessCoreErrorInfo][FILE:device_error_proc.cc][LINE:656] The device(0), core list[0-0], error code is:[FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:589] coreId( 0): 0x800000 [FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:603] Aicore kernel execute failed, device_id=0, stream_id=3, report_stream_id=3, task_id=10985, flip_num=0, fault kernelname=72/82-1_72_Index73_tvmbin, program id=81, hash=11041528171287740400.[FUNC:GetError][FILE:stream.cc][LINE:1454] [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1454] Failed to submit kernel task, retCode=0x7100002.[FUNC:LaunchKernelSubmit][FILE:context.cc][LINE:632] kernel launch submit failed.[FUNC:LaunchKernel][FILE:context.cc][LINE:732] rtKernelLaunchWithFlagV2 execute failed, reason=[task new memory error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] Call rtKernelLaunchWithFlagV2(stubfunc, blockdim, &argsex, smdesc, stream, 0U, &cfg) fail, ret: 0x32899[FUNC:DoLaunchKernelWithArgsEx][FILE:op_task.cc][LINE:791] Call static_cast(DoLaunchKernelWithArgsEx(stream)) fail, ret: 0x32899[FUNC:DoLaunchKernel][FILE:optask.cc][LINE:781] invoke rtKernelLaunch failed, ret = 207001, task = 189/215-1_189_MatMul190_tvmbin[FUNC:LaunchKernel][FILE:op_task.cc][LINE:394] [Exec][Op]Execute op failed. op type = MatMul, ge result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

DEVICE[0] PID[358]: EXCEPTION STREAM: Exception info:TGID=196191, model id=65535, stream id=3, stream phase=3 Message info[0]:RTS_HWTS: aicore exception, slot_id=31, stream_id=3 Other info[0]:time=2024-05-10-10:44:52.267.860, function=int_process_hwts_task_exception, line=296, error code=0x26 Traceback (most recent call last): File "/usr/local/bin/llamafactory-cli", line 8, in sys.exit(main()) File "/data/mlops/training/LLaMA-Factory-0.7.0/src/llmtuner/cli.py", line 49, in main run_exp() File "/data/mlops/training/LLaMA-Factory-0.7.0/src/llmtuner/train/tuner.py", line 39, in run_exp run_dpo(model_args, data_args, training_args, finetuning_args, callbacks) File "/data/mlops/training/LLaMA-Factory-0.7.0/src/llmtuner/train/dpo/workflow.py", line 61, in run_dpo train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/usr/local/lib/python3.9/dist-packages/transformers/trainer.py", line 1624, in train return inner_training_loop( File "/usr/local/lib/python3.9/dist-packages/transformers/trainer.py", line 1963, in _inner_training_loop if ( RuntimeError: The Inner error is reported as above. Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1. [ERROR] 2024-05-10-02:44:52 (PID:358, Device:0, RankID:-1) ERR00100 PTA call acl api failed [W NPUStream.cpp:382] Warning: NPU warning, error code is 507015[Error]: [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] EH9999 wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeUsedDevices) [W NPUStream.cpp:365] Warning: NPU warning, error code is 507015[Error]: [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] EH9999 wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) [W NPUStream.cpp:365] Warning: NPU warning, error code is 507015[Error]: [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] EH9999 wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) [W NPUStream.cpp:365] Warning: NPU warning, error code is 507015[Error]: [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] EH9999 wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) [W NPUStream.cpp:365] Warning: NPU warning, error code is 507015[Error]: [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] EH9999 wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice)

0%| | 0/336 [03:47<?, ?it/s] `

Expected behavior

执行脚本:CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_dpo.yaml期望跑通不报错。 目前除执行CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_pretrain.yaml能跑通外其他的都不行,报错都一样。

System Info

Others

经过多次执行发现如下规律: 每执行一次脚本报错日志里的:The extend info from device(0), serial number is 后面的数字就会递增一次,即使删掉pod重新启动这个数字也不会归0。

guoyjalihy commented 4 months ago

补充 torch-npu 版本 torch 2.2.0+cpu torch-npu 2.2.0

codemayq commented 4 months ago

昇腾用户可以加入这个群做进一步交流 img_v3_02ap_12409f4b-3caf-41a0-9d86-d823a9b9cfag

guoyjalihy commented 4 months ago

补充新发现:通过k8s启pod的形式去训练就会报错,换成在宿主机上直接执行就能跑通所有单卡训练任务,在k8s的yaml里配置了`env:

hiyouga commented 4 months ago

ASCEND_RT_VISIBLE_DEVICES

learnArmy commented 4 months ago

请问一下解决了吗,我也遇到了类似的问题,EZ9999: Inner Error!我的fault kernel_name=00__11_EPNET/GatherV2

guoyjalihy commented 4 months ago

请问一下解决了吗,我也遇到了类似的问题,EZ9999: Inner Error!我的fault kernel_name=00__11_EPNET/GatherV2

没有解决,加上ASCEND_RT_VISIBLE_DEVICES这个也没用,在k8s上怎么跑都不行,宿主机上就可以

976311200 commented 4 months ago

我和华为的交流过,yaml需要配置ascend容器运行时,然后华为给我们的yaml里面额外给了调度器指定volcano和nodeseletct

renllll commented 3 months ago

请问一下,你有成功跑通多npu的增量预训练吗

Xavier-123 commented 3 months ago

昇腾用户可以加入这个群做进一步交流 img_v3_02ap_12409f4b-3caf-41a0-9d86-d823a9b9cfag

可以更新二维码吗

paul-yangmy commented 3 months ago

昇腾用户可以加入这个群做进一步交流 img_v3_02ap_12409f4b-3caf-41a0-9d86-d823a9b9cfag

可以更新二维码吗

+1 d大佬们,更新下二维码吧

T-freedom commented 2 months ago

请问怎么解决这个问题的