Closed Candicepan closed 1 year ago
hengzi Give it to me
通过配置 KusciaJob 提交作业,任务失败,任务详情如下
kubectl get kt job-ss-lr-bc98909a3be3 -o yaml
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
creationTimestamp: "2023-08-04T07:51:08Z"
generation: 1
labels:
kuscia.secretflow/controller: kuscia-job
kuscia.secretflow/interconn-protocol-type: kuscia
kuscia.secretflow/job-id: job-ss-lr
kuscia.secretflow/self-cluster-as-initiator: "true"
kuscia.secretflow/task-alias: ss_lr_1
name: job-ss-lr-bc98909a3be3
ownerReferences:
- apiVersion: kuscia.secretflow/v1alpha1
blockOwnerDeletion: true
controller: true
kind: KusciaJob
name: job-ss-lr
uid: 71e82577-c8ec-44e2-aaae-25750c0d319c
resourceVersion: "6642"
uid: 4b65bdc4-ce9a-46ba-ab41-6dfb33c5e001
spec:
initiator: alice
parties:
- appImageRef: ss-lr
domainID: alice
role: host
template:
spec: {}
- appImageRef: ss-lr
domainID: bob
role: guest
template:
spec: {}
scheduleConfig: {}
taskInputConfig: '{"name":"ss_lr_1","module_name":"ss_lr","output":[{"type":"dataset","key":"result"}],"role":{"host":["alice"],"guest":["bob"]},"initiator":{"role":"host","node_id":"alice"},"task_params":{"host":{"0":{"has_label":true,"name":"perfect_logit_a.csv","namespace":"data"}},"guest":{"0":{"has_label":false,"name":"perfect_logit_b.csv","namespace":"data"}},"common":{"skip_rows":1,"algo":"ss_lr","protocol_families":"ss","batch_size":21,"last_batch_policy":"discard","num_epoch":1,"l0_norm":0,"l1_norm":0,"l2_norm":0.5,"optimizer":"sgd","learning_rate":0.0001,"sigmoid_mode":"minimax_1","protocol":"semi2k","field":64,"fxp_bits":18,"trunc_mode":"probabilistic","shard_serialize_format":"raw","use_ttp":true,"ttp_server_host":"ttp-server:9449","ttp_session_id":"interconnection-root","ttp_adjust_rank":0}}}'
status:
completionTime: "2023-08-04T07:51:20Z"
conditions:
- lastTransitionTime: "2023-08-04T07:51:08Z"
status: "True"
type: ResourceCreated
- lastTransitionTime: "2023-08-04T07:51:08Z"
status: "True"
type: Running
- lastTransitionTime: "2023-08-04T07:51:20Z"
status: "False"
type: Success
lastReconcileTime: "2023-08-04T07:51:20Z"
message: The remaining non-failed parties counts 1 is less than the task success
threshold 2
partyTaskStatus:
- domainID: alice
message: Kuscia task failed
phase: Failed
role: host
- domainID: bob
phase: Failed
role: guest
phase: Failed
podStatuses:
alice/job-ss-lr-bc98909a3be3-host-0:
namespace: alice
nodeName: hzh-kuscia-autonomy-alice
podName: job-ss-lr-bc98909a3be3-host-0
podPhase: Failed
bob/job-ss-lr-bc98909a3be3-guest-0:
namespace: bob
podName: job-ss-lr-bc98909a3be3-guest-0
podPhase: Failed
reason: Error
terminationLog: 'container[ss-lr] terminated state reason "Error", message:
"[2023-08-04 07:51:11.359] [info] [util.cc:52] create link context for blackbox
failed: [Enforce fail at external/yacl/yacl/link/factory_brpc_blackbox.cc:64]
iter != party_info.end(). cannot find config.self_role: in ENV(config.node_id.*)\nStacktrace:\n#0
yacl::link::FactoryBrpcBlackBox::GetPartyNodeInfoFromEnv()+0x55c0cf29392f\n#1
ic_impl::util::CreateLinkContextForBlackBox()+0x55c0cece3066\n#2 ic_impl::util::MakeLink()+0x55c0cece34c5\n#3
ic_impl::CreateIcContext()+0x55c0cece1989\n#4 main+0x55c0ce9f9194\n#5 (unknown)+0x7fb00e0ebd90\n\nI0804
07:51:11.376988 7 external/com_github_brpc_brpc/src/brpc/server.cpp:1113]
Server[yacl::link::internal::ReceiverServiceImpl] is serving on port=9530.\nI0804
07:51:11.379607 7 external/com_github_brpc_brpc/src/brpc/server.cpp:1116]
Check out http://job-ss-lr-bc98909a3be3-guest-0:9530 in web browser.\nI0804
07:51:11.482255 71 external/com_github_brpc_brpc/src/brpc/socket.cpp:2345]
Checking Socket{id=0 addr=127.0.0.1:9531} (0x55c0d2405e80)\nI0804 07:51:20.441656 7
external/com_github_brpc_brpc/src/brpc/server.cpp:1173] Server[yacl::link::internal::ReceiverServiceImpl]
is going to quit\n[2023-08-04 07:51:20.443] [error] [channel.h:112] ChannelBase
destructor is called before WaitLinkTaskFinish, try stop send thread\n[2023-08-04
07:51:20.443] [error] [ic_main.cc:32] run failed: [external/yacl/yacl/link/context.cc:167]
connect to mesh failed, failed to setup connection to rank=1\n"'
startTime: "2023-08-04T07:51:08Z"
通过银联 BFIA 协议创建作业 API 接口提交作业,也是失败
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
creationTimestamp: "2023-08-04T07:57:35Z"
generation: 2
labels:
kuscia.secretflow/controller: kuscia-job
kuscia.secretflow/interconn-protocol-type: bfia
kuscia.secretflow/job-id: job-ss-lr
kuscia.secretflow/self-cluster-as-initiator: "true"
kuscia.secretflow/task-alias: ss_lr_1
kuscia.secretflow/task-unschedulable: "false"
name: job-ss-lr-418fd95370ea
ownerReferences:
- apiVersion: kuscia.secretflow/v1alpha1
blockOwnerDeletion: true
controller: true
kind: KusciaJob
name: job-ss-lr
uid: acedbd94-b554-4300-9af2-d96e35e4cc08
resourceVersion: "7466"
uid: 91bb4a62-be69-4751-b3aa-218debaa8f6f
spec:
initiator: alice
parties:
- appImageRef: ss-lr
domainID: alice
role: host
template:
spec:
containers:
- env:
- name: config.task_id
value: job-ss-lr-418fd95370ea
- name: config.session_id
value: session_job-ss-lr-418fd95370ea
- name: config.trace_id
value: trace_job-ss-lr-418fd95370ea
- name: config.token
value: token_job-ss-lr-418fd95370ea
- name: runtime.component.name
value: ss_lr
- name: config.self_role
value: host.0
- name: config.inst_id.host.0
value: alice
- name: config.node_id.host.0
value: alice
- name: config.inst_id.guest.0
value: bob
- name: config.node_id.guest.0
value: bob
- name: runtime.component.parameter.protocol
value: semi2k
- name: runtime.component.parameter.protocol_families
value: ss
- name: runtime.component.parameter.shard_serialize_format
value: raw
- name: runtime.component.parameter.use_ttp
value: "true"
- name: runtime.component.parameter.ttp_server_host
value: ttp-server:9449
- name: runtime.component.parameter.algo
value: ss_lr
- name: runtime.component.parameter.field
value: "64"
- name: runtime.component.parameter.namespace
value: data
- name: runtime.component.parameter.l2_norm
value: "0.5"
- name: runtime.component.parameter.trunc_mode
value: probabilistic
- name: runtime.component.parameter.num_epoch
value: "1"
- name: runtime.component.parameter.sigmoid_mode
value: minimax_1
- name: runtime.component.parameter.ttp_session_id
value: interconnection-root
- name: runtime.component.parameter.name
value: perfect_logit_a.csv
- name: runtime.component.parameter.has_label
value: "true"
- name: runtime.component.parameter.skip_rows
value: "1"
- name: runtime.component.parameter.fxp_bits
value: "18"
- name: runtime.component.parameter.ttp_adjust_rank
value: "0"
- name: runtime.component.parameter.batch_size
value: "21"
- name: runtime.component.parameter.learning_rate
value: "0.0001"
- name: runtime.component.parameter.l0_norm
value: "0"
- name: runtime.component.parameter.last_batch_policy
value: discard
- name: runtime.component.parameter.l1_norm
value: "0"
- name: runtime.component.parameter.optimizer
value: sgd
- name: runtime.component.input.train_data
value: '{"namespace":"data","name":"perfect_logit_a.csv"}'
- name: runtime.component.output.train_data
value: '{"namespace":"job-ss-lr-host-0","name":"job-ss-lr-418fd95370ea-result"}'
name: ""
resources: {}
- appImageRef: ss-lr
domainID: bob
role: guest
template:
spec:
containers:
- env:
- name: config.task_id
value: job-ss-lr-418fd95370ea
- name: config.session_id
value: session_job-ss-lr-418fd95370ea
- name: config.trace_id
value: trace_job-ss-lr-418fd95370ea
- name: config.token
value: token_job-ss-lr-418fd95370ea
- name: runtime.component.name
value: ss_lr
- name: config.inst_id.host.0
value: alice
- name: config.node_id.host.0
value: alice
- name: config.self_role
value: guest.0
- name: config.inst_id.guest.0
value: bob
- name: config.node_id.guest.0
value: bob
- name: runtime.component.parameter.name
value: perfect_logit_b.csv
- name: runtime.component.parameter.algo
value: ss_lr
- name: runtime.component.parameter.last_batch_policy
value: discard
- name: runtime.component.parameter.num_epoch
value: "1"
- name: runtime.component.parameter.ttp_session_id
value: interconnection-root
- name: runtime.component.parameter.field
value: "64"
- name: runtime.component.parameter.l2_norm
value: "0.5"
- name: runtime.component.parameter.ttp_server_host
value: ttp-server:9449
- name: runtime.component.parameter.skip_rows
value: "1"
- name: runtime.component.parameter.l0_norm
value: "0"
- name: runtime.component.parameter.learning_rate
value: "0.0001"
- name: runtime.component.parameter.shard_serialize_format
value: raw
- name: runtime.component.parameter.namespace
value: data
- name: runtime.component.parameter.fxp_bits
value: "18"
- name: runtime.component.parameter.trunc_mode
value: probabilistic
- name: runtime.component.parameter.ttp_adjust_rank
value: "0"
- name: runtime.component.parameter.protocol_families
value: ss
- name: runtime.component.parameter.use_ttp
value: "true"
- name: runtime.component.parameter.has_label
value: "false"
- name: runtime.component.parameter.protocol
value: semi2k
- name: runtime.component.parameter.sigmoid_mode
value: minimax_1
- name: runtime.component.parameter.l1_norm
value: "0"
- name: runtime.component.parameter.optimizer
value: sgd
- name: runtime.component.parameter.batch_size
value: "21"
- name: runtime.component.input.train_data
value: '{"namespace":"data","name":"perfect_logit_b.csv"}'
- name: runtime.component.output.train_data
value: '{"namespace":"job-ss-lr-guest-0","name":"job-ss-lr-418fd95370ea-result"}'
name: ""
resources: {}
scheduleConfig: {}
taskInputConfig: '{"name":"ss_lr_1","module_name":"ss_lr","output":[{"type":"dataset","key":"result"}],"role":{"host":["alice"],"guest":["bob"]},"initiator":{"role":"host","node_id":"alice"},"task_params":{"host":{"0":{"has_label":true,"name":"perfect_logit_a.csv","namespace":"data"}},"guest":{"0":{"has_label":false,"name":"perfect_logit_b.csv","namespace":"data"}},"common":{"algo":"ss_lr","batch_size":21,"field":64,"fxp_bits":18,"l0_norm":0,"l1_norm":0,"l2_norm":0.5,"last_batch_policy":"discard","learning_rate":0.0001,"num_epoch":1,"optimizer":"sgd","protocol":"semi2k","protocol_families":"ss","shard_serialize_format":"raw","sigmoid_mode":"minimax_1","skip_rows":1,"trunc_mode":"probabilistic","ttp_adjust_rank":0,"ttp_server_host":"ttp-server:9449","ttp_session_id":"interconnection-root","use_ttp":true}}}'
status:
completionTime: "2023-08-04T07:58:06Z"
conditions:
- lastTransitionTime: "2023-08-04T07:57:35Z"
status: "True"
type: ResourceCreated
- lastTransitionTime: "2023-08-04T07:57:35Z"
status: "True"
type: Running
- lastTransitionTime: "2023-08-04T07:58:06Z"
status: "False"
type: Success
lastReconcileTime: "2023-08-04T07:58:06Z"
message: The remaining number of parties 1 is less than the schedulable threshold
2
phase: Failed
podStatuses:
alice/job-ss-lr-418fd95370ea-host-0:
message: reserved task resources belonging to the task resource group "job-ss-lr-418fd95370ea"
doesn't meet the minReservedMembers, task resource "bob/job-ss-lr-418fd95370ea-f8113c8ec66f"
phase is "Reserving"
namespace: alice
podName: job-ss-lr-418fd95370ea-host-0
podPhase: Failed
reason: Unschedulable
bob/job-ss-lr-418fd95370ea-guest-0:
namespace: bob
podName: job-ss-lr-418fd95370ea-guest-0
podPhase: Failed
reason: TaskResourceGroupPhaseFailed
startTime: "2023-08-04T07:57:35Z"
Hi @hengzi , 感谢反馈,这个作业现在的确有问题,后面我们会更新一下kuscia镜像,更新好之后,我会在这里@你,你再基于最新的kuscia镜像测试下。
Hi @hengzi , Kuscia官方镜像已更新,请通过docker pull命令更新一下本地镜像
根据文档,重新安装节点 https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans/tutorial/run_bfia_job_cn
验证成功
alice 节点容器
bob 节点容器
Hi @hengzi ,是采用的下述两种方式,都验证过了吗?
Hi @hengzi ,是采用的下述两种方式,都验证过了吗?
- 通过配置 KusciaJob 提交作业
- 通过银联 BFIA 协议 API 接口提交作业
两种方式都成功了
Hi @hengzi ,感谢你的验证。
任务介绍
详细要求
请基于 Kuscia master 验证上述文档的具体操作流程和脚本是否能够成功运行。在部署脚本的同时,也欢迎对文档内容通顺(语句与错别字)进行修正。
能力要求
操作指引