hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
32.23k stars 3.95k forks source link

昇腾单卡可以lora微调,多卡出错 #5638

Open leoneyar opened 2 days ago

leoneyar commented 2 days ago

Reminder

System Info

Reproduction

RuntimeError: [ERROR] HCCL error in: torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:64 [ERROR] 2024-10-09-10:06:30 (PID:1506967, Device:1, RankID:1) ERR02200 DIST call hccl api failed. EC0010: Failed to import Python module [ModuleNotFoundError: No module named 'tbe.common.repository_manager.utils.repository_manager_log'.]. Solution: Check that all required components are properly installed and the specified Python path matches the Python installation directory. (If the path does not match the directory, run set_env.sh in the installation package.) TraceBack (most recent call last): [GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeInner][FILE:tbe_op_store_adapter.cc][LINE:1623] [SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager.cc][LINE:85] [SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager.cc][LINE:126] [FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager.cc][LINE:124] PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager.cc][LINE:96] OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib.cc][LINE:237] GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib.cc][LINE:165] [Initialize][Ge]GEInitialize failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] [Init][Compiler]Init compiler failed[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145] [Set][Options]OpCompileProcessor init failed![FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145] Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)

Expected behavior

No response

Others

No response

codemayq commented 1 day ago

是多机吗?看着是 设备状态本身有异常,没有连接上

leoneyar commented 1 day ago

是多机吗?看着是 设备状态本身有异常,没有连接上

单机,多卡