secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
73 stars 53 forks source link

验证 Kuscia《如何运行一个互联互通银联 BFIA 协议作业》教程文档,包括文档流程和 bfia 脚本 #25

Closed Candicepan closed 1 year ago

Candicepan commented 1 year ago

此 ISSUE 为 隐语开源共建计划(SecretFlow Open Source Contribution Plan,简称 SF OSCP)第二期任务 ISSUE,欢迎社区开发者参与共建~ 若有感兴趣想要认领的任务,但还未报名,辛苦先完成报名进行哈~

任务介绍

详细要求

请基于 Kuscia master 验证上述文档的具体操作流程和脚本是否能够成功运行。在部署脚本的同时,也欢迎对文档内容通顺(语句与错别字)进行修正。

能力要求

操作指引

hengzi commented 1 year ago

hengzi Give it to me

hengzi commented 1 year ago

通过配置 KusciaJob 提交作业,任务失败,任务详情如下

kubectl get kt job-ss-lr-bc98909a3be3 -o yaml

apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  creationTimestamp: "2023-08-04T07:51:08Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/interconn-protocol-type: kuscia
    kuscia.secretflow/job-id: job-ss-lr
    kuscia.secretflow/self-cluster-as-initiator: "true"
    kuscia.secretflow/task-alias: ss_lr_1
  name: job-ss-lr-bc98909a3be3
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: job-ss-lr
    uid: 71e82577-c8ec-44e2-aaae-25750c0d319c
  resourceVersion: "6642"
  uid: 4b65bdc4-ce9a-46ba-ab41-6dfb33c5e001
spec:
  initiator: alice
  parties:
  - appImageRef: ss-lr
    domainID: alice
    role: host
    template:
      spec: {}
  - appImageRef: ss-lr
    domainID: bob
    role: guest
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: '{"name":"ss_lr_1","module_name":"ss_lr","output":[{"type":"dataset","key":"result"}],"role":{"host":["alice"],"guest":["bob"]},"initiator":{"role":"host","node_id":"alice"},"task_params":{"host":{"0":{"has_label":true,"name":"perfect_logit_a.csv","namespace":"data"}},"guest":{"0":{"has_label":false,"name":"perfect_logit_b.csv","namespace":"data"}},"common":{"skip_rows":1,"algo":"ss_lr","protocol_families":"ss","batch_size":21,"last_batch_policy":"discard","num_epoch":1,"l0_norm":0,"l1_norm":0,"l2_norm":0.5,"optimizer":"sgd","learning_rate":0.0001,"sigmoid_mode":"minimax_1","protocol":"semi2k","field":64,"fxp_bits":18,"trunc_mode":"probabilistic","shard_serialize_format":"raw","use_ttp":true,"ttp_server_host":"ttp-server:9449","ttp_session_id":"interconnection-root","ttp_adjust_rank":0}}}'
status:
  completionTime: "2023-08-04T07:51:20Z"
  conditions:
  - lastTransitionTime: "2023-08-04T07:51:08Z"
    status: "True"
    type: ResourceCreated
  - lastTransitionTime: "2023-08-04T07:51:08Z"
    status: "True"
    type: Running
  - lastTransitionTime: "2023-08-04T07:51:20Z"
    status: "False"
    type: Success
  lastReconcileTime: "2023-08-04T07:51:20Z"
  message: The remaining non-failed parties counts 1 is less than the task success
    threshold 2
  partyTaskStatus:
  - domainID: alice
    message: Kuscia task failed
    phase: Failed
    role: host
  - domainID: bob
    phase: Failed
    role: guest
  phase: Failed
  podStatuses:
    alice/job-ss-lr-bc98909a3be3-host-0:
      namespace: alice
      nodeName: hzh-kuscia-autonomy-alice
      podName: job-ss-lr-bc98909a3be3-host-0
      podPhase: Failed
    bob/job-ss-lr-bc98909a3be3-guest-0:
      namespace: bob
      podName: job-ss-lr-bc98909a3be3-guest-0
      podPhase: Failed
      reason: Error
      terminationLog: 'container[ss-lr] terminated state reason "Error", message:
        "[2023-08-04 07:51:11.359] [info] [util.cc:52] create link context for blackbox
        failed: [Enforce fail at external/yacl/yacl/link/factory_brpc_blackbox.cc:64]
        iter != party_info.end(). cannot find config.self_role: in ENV(config.node_id.*)\nStacktrace:\n#0
        yacl::link::FactoryBrpcBlackBox::GetPartyNodeInfoFromEnv()+0x55c0cf29392f\n#1
        ic_impl::util::CreateLinkContextForBlackBox()+0x55c0cece3066\n#2 ic_impl::util::MakeLink()+0x55c0cece34c5\n#3
        ic_impl::CreateIcContext()+0x55c0cece1989\n#4 main+0x55c0ce9f9194\n#5 (unknown)+0x7fb00e0ebd90\n\nI0804
        07:51:11.376988     7 external/com_github_brpc_brpc/src/brpc/server.cpp:1113]
        Server[yacl::link::internal::ReceiverServiceImpl] is serving on port=9530.\nI0804
        07:51:11.379607     7 external/com_github_brpc_brpc/src/brpc/server.cpp:1116]
        Check out http://job-ss-lr-bc98909a3be3-guest-0:9530 in web browser.\nI0804
        07:51:11.482255    71 external/com_github_brpc_brpc/src/brpc/socket.cpp:2345]
        Checking Socket{id=0 addr=127.0.0.1:9531} (0x55c0d2405e80)\nI0804 07:51:20.441656     7
        external/com_github_brpc_brpc/src/brpc/server.cpp:1173] Server[yacl::link::internal::ReceiverServiceImpl]
        is going to quit\n[2023-08-04 07:51:20.443] [error] [channel.h:112] ChannelBase
        destructor is called before WaitLinkTaskFinish, try stop send thread\n[2023-08-04
        07:51:20.443] [error] [ic_main.cc:32] run failed: [external/yacl/yacl/link/context.cc:167]
        connect to mesh failed, failed to setup connection to rank=1\n"'
  startTime: "2023-08-04T07:51:08Z"
hengzi commented 1 year ago

通过银联 BFIA 协议创建作业 API 接口提交作业,也是失败

apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  creationTimestamp: "2023-08-04T07:57:35Z"
  generation: 2
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/interconn-protocol-type: bfia
    kuscia.secretflow/job-id: job-ss-lr
    kuscia.secretflow/self-cluster-as-initiator: "true"
    kuscia.secretflow/task-alias: ss_lr_1
    kuscia.secretflow/task-unschedulable: "false"
  name: job-ss-lr-418fd95370ea
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: job-ss-lr
    uid: acedbd94-b554-4300-9af2-d96e35e4cc08
  resourceVersion: "7466"
  uid: 91bb4a62-be69-4751-b3aa-218debaa8f6f
spec:
  initiator: alice
  parties:
  - appImageRef: ss-lr
    domainID: alice
    role: host
    template:
      spec:
        containers:
        - env:
          - name: config.task_id
            value: job-ss-lr-418fd95370ea
          - name: config.session_id
            value: session_job-ss-lr-418fd95370ea
          - name: config.trace_id
            value: trace_job-ss-lr-418fd95370ea
          - name: config.token
            value: token_job-ss-lr-418fd95370ea
          - name: runtime.component.name
            value: ss_lr
          - name: config.self_role
            value: host.0
          - name: config.inst_id.host.0
            value: alice
          - name: config.node_id.host.0
            value: alice
          - name: config.inst_id.guest.0
            value: bob
          - name: config.node_id.guest.0
            value: bob
          - name: runtime.component.parameter.protocol
            value: semi2k
          - name: runtime.component.parameter.protocol_families
            value: ss
          - name: runtime.component.parameter.shard_serialize_format
            value: raw
          - name: runtime.component.parameter.use_ttp
            value: "true"
          - name: runtime.component.parameter.ttp_server_host
            value: ttp-server:9449
          - name: runtime.component.parameter.algo
            value: ss_lr
          - name: runtime.component.parameter.field
            value: "64"
          - name: runtime.component.parameter.namespace
            value: data
          - name: runtime.component.parameter.l2_norm
            value: "0.5"
          - name: runtime.component.parameter.trunc_mode
            value: probabilistic
          - name: runtime.component.parameter.num_epoch
            value: "1"
          - name: runtime.component.parameter.sigmoid_mode
            value: minimax_1
          - name: runtime.component.parameter.ttp_session_id
            value: interconnection-root
          - name: runtime.component.parameter.name
            value: perfect_logit_a.csv
          - name: runtime.component.parameter.has_label
            value: "true"
          - name: runtime.component.parameter.skip_rows
            value: "1"
          - name: runtime.component.parameter.fxp_bits
            value: "18"
          - name: runtime.component.parameter.ttp_adjust_rank
            value: "0"
          - name: runtime.component.parameter.batch_size
            value: "21"
          - name: runtime.component.parameter.learning_rate
            value: "0.0001"
          - name: runtime.component.parameter.l0_norm
            value: "0"
          - name: runtime.component.parameter.last_batch_policy
            value: discard
          - name: runtime.component.parameter.l1_norm
            value: "0"
          - name: runtime.component.parameter.optimizer
            value: sgd
          - name: runtime.component.input.train_data
            value: '{"namespace":"data","name":"perfect_logit_a.csv"}'
          - name: runtime.component.output.train_data
            value: '{"namespace":"job-ss-lr-host-0","name":"job-ss-lr-418fd95370ea-result"}'
          name: ""
          resources: {}
  - appImageRef: ss-lr
    domainID: bob
    role: guest
    template:
      spec:
        containers:
        - env:
          - name: config.task_id
            value: job-ss-lr-418fd95370ea
          - name: config.session_id
            value: session_job-ss-lr-418fd95370ea
          - name: config.trace_id
            value: trace_job-ss-lr-418fd95370ea
          - name: config.token
            value: token_job-ss-lr-418fd95370ea
          - name: runtime.component.name
            value: ss_lr
          - name: config.inst_id.host.0
            value: alice
          - name: config.node_id.host.0
            value: alice
          - name: config.self_role
            value: guest.0
          - name: config.inst_id.guest.0
            value: bob
          - name: config.node_id.guest.0
            value: bob
          - name: runtime.component.parameter.name
            value: perfect_logit_b.csv
          - name: runtime.component.parameter.algo
            value: ss_lr
          - name: runtime.component.parameter.last_batch_policy
            value: discard
          - name: runtime.component.parameter.num_epoch
            value: "1"
          - name: runtime.component.parameter.ttp_session_id
            value: interconnection-root
          - name: runtime.component.parameter.field
            value: "64"
          - name: runtime.component.parameter.l2_norm
            value: "0.5"
          - name: runtime.component.parameter.ttp_server_host
            value: ttp-server:9449
          - name: runtime.component.parameter.skip_rows
            value: "1"
          - name: runtime.component.parameter.l0_norm
            value: "0"
          - name: runtime.component.parameter.learning_rate
            value: "0.0001"
          - name: runtime.component.parameter.shard_serialize_format
            value: raw
          - name: runtime.component.parameter.namespace
            value: data
          - name: runtime.component.parameter.fxp_bits
            value: "18"
          - name: runtime.component.parameter.trunc_mode
            value: probabilistic
          - name: runtime.component.parameter.ttp_adjust_rank
            value: "0"
          - name: runtime.component.parameter.protocol_families
            value: ss
          - name: runtime.component.parameter.use_ttp
            value: "true"
          - name: runtime.component.parameter.has_label
            value: "false"
          - name: runtime.component.parameter.protocol
            value: semi2k
          - name: runtime.component.parameter.sigmoid_mode
            value: minimax_1
          - name: runtime.component.parameter.l1_norm
            value: "0"
          - name: runtime.component.parameter.optimizer
            value: sgd
          - name: runtime.component.parameter.batch_size
            value: "21"
          - name: runtime.component.input.train_data
            value: '{"namespace":"data","name":"perfect_logit_b.csv"}'
          - name: runtime.component.output.train_data
            value: '{"namespace":"job-ss-lr-guest-0","name":"job-ss-lr-418fd95370ea-result"}'
          name: ""
          resources: {}
  scheduleConfig: {}
  taskInputConfig: '{"name":"ss_lr_1","module_name":"ss_lr","output":[{"type":"dataset","key":"result"}],"role":{"host":["alice"],"guest":["bob"]},"initiator":{"role":"host","node_id":"alice"},"task_params":{"host":{"0":{"has_label":true,"name":"perfect_logit_a.csv","namespace":"data"}},"guest":{"0":{"has_label":false,"name":"perfect_logit_b.csv","namespace":"data"}},"common":{"algo":"ss_lr","batch_size":21,"field":64,"fxp_bits":18,"l0_norm":0,"l1_norm":0,"l2_norm":0.5,"last_batch_policy":"discard","learning_rate":0.0001,"num_epoch":1,"optimizer":"sgd","protocol":"semi2k","protocol_families":"ss","shard_serialize_format":"raw","sigmoid_mode":"minimax_1","skip_rows":1,"trunc_mode":"probabilistic","ttp_adjust_rank":0,"ttp_server_host":"ttp-server:9449","ttp_session_id":"interconnection-root","use_ttp":true}}}'
status:
  completionTime: "2023-08-04T07:58:06Z"
  conditions:
  - lastTransitionTime: "2023-08-04T07:57:35Z"
    status: "True"
    type: ResourceCreated
  - lastTransitionTime: "2023-08-04T07:57:35Z"
    status: "True"
    type: Running
  - lastTransitionTime: "2023-08-04T07:58:06Z"
    status: "False"
    type: Success
  lastReconcileTime: "2023-08-04T07:58:06Z"
  message: The remaining number of parties 1 is less than the schedulable threshold
    2
  phase: Failed
  podStatuses:
    alice/job-ss-lr-418fd95370ea-host-0:
      message: reserved task resources belonging to the task resource group "job-ss-lr-418fd95370ea"
        doesn't meet the minReservedMembers, task resource "bob/job-ss-lr-418fd95370ea-f8113c8ec66f"
        phase is "Reserving"
      namespace: alice
      podName: job-ss-lr-418fd95370ea-host-0
      podPhase: Failed
      reason: Unschedulable
    bob/job-ss-lr-418fd95370ea-guest-0:
      namespace: bob
      podName: job-ss-lr-418fd95370ea-guest-0
      podPhase: Failed
  reason: TaskResourceGroupPhaseFailed
  startTime: "2023-08-04T07:57:35Z"
gshilei commented 1 year ago

Hi @hengzi , 感谢反馈,这个作业现在的确有问题,后面我们会更新一下kuscia镜像,更新好之后,我会在这里@你,你再基于最新的kuscia镜像测试下。

gshilei commented 1 year ago

Hi @hengzi , Kuscia官方镜像已更新,请通过docker pull命令更新一下本地镜像

根据文档,重新安装节点 https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans/tutorial/run_bfia_job_cn

hengzi commented 1 year ago

验证成功

alice 节点容器 image

bob 节点容器 image

gshilei commented 1 year ago

Hi @hengzi ,是采用的下述两种方式,都验证过了吗?

hengzi commented 1 year ago

Hi @hengzi ,是采用的下述两种方式,都验证过了吗?

  • 通过配置 KusciaJob 提交作业
  • 通过银联 BFIA 协议 API 接口提交作业

两种方式都成功了

image

gshilei commented 1 year ago

Hi @hengzi ,感谢你的验证。