secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
73 stars 52 forks source link

kuscia多机部署节点转发作业validate失败 #423

Closed shnnosuke34725 closed 4 days ago

shnnosuke34725 commented 1 month ago

Issue Type

Running

Search for existing issues similar to yours

Yes

OS Platform and Distribution

Ubuntu 18.04.6 LTS (GNU/Linux 5.4.0-150-generic x86_64)

Kuscia Version

kuscia v0.9.0b0

Deployment

docker

deployment Version

docker 24.0.5

App Running type

secretflow

App Running version

secretflow-lite-anolis8:1.7.0b0

Configuration file used to run kuscia.

# alice_autonomy.yaml
mode: autonomy
domainID: alice
domainKeyData: LS0tLS1CRU...
logLevel: INFO
runtime: runc
runk:
  namespace: ""
  dnsServers: []
  kubeconfigFile: ""
capacity:
  cpu: ""

# bob_autonomy.yaml
mode: autonomy
domainID: bob
domainKeyData: LS0tLS1CR...
logLevel: INFO
runtime: runc
runk:
  namespace: ""
  dnsServers: []
  kubeconfigFile: ""
capacity:
  cpu: ""

# carol_autonomy.yaml
mode: autonomy
domainID: carol
domainKeyData: LS0tLS1CR...
logLevel: INFO
runtime: runc
runk:
  namespace: ""
  dnsServers: []
  kubeconfigFile: ""
capacity:
  cpu: ""

What happend and What you expected to happen.

使用三台机器创建了三个autonomy节点alice,bob和carol,其中已配置alice和bob,bob和carol的路由规则,都可以直连并跑通任务,并且已按照节点转发文档完成alice-bob-carol的节点转发的配置,但是执行任务时显示“Validate job failed, can't find party namespace carol under cluster”,想请问一下问题可能出在哪里或者哪个步骤上。
参考文档:https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.9.0b0/reference/concepts/domainroute_cn

Kuscia log output.

# KusicaJob的详细信息
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaJob
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kuscia.secretflow/v1alpha1","kind":"KusciaJob","metadata":{"annotations":{},"name":"job-best-effort-linear","namespace":"cross-domain"},"spec":{"initiator":"alice","maxParallelism":2,"scheduleMode":"BestEffort","tasks":[{"alias":"job-psi","appImage":"secretflow-image","parties":[{"domainID":"alice"},{"domainID":"carol"}],"priority":100,"taskID":"job-psi","taskInputConfig":"{\"sf_datasource_config\":{\"alice\":{\"id\":\"default-data-source\"},\"carol\":{\"id\":\"default-data-source\"}},\"sf_cluster_desc\":{\"parties\":[\"alice\",\"carol\"],\"devices\":[{\"name\":\"spu\",\"type\":\"spu\",\"parties\":[\"alice\",\"carol\"],\"config\":\"{\\\"runtime_config\\\":{\\\"protocol\\\":\\\"REF2K\\\",\\\"field\\\":\\\"FM64\\\"},\\\"link_desc\\\":{\\\"connect_retry_times\\\":60,\\\"connect_retry_interval_ms\\\":1000,\\\"brpc_channel_protocol\\\":\\\"http\\\",\\\"brpc_channel_connection_type\\\":\\\"pooled\\\",\\\"recv_timeout_ms\\\":1200000,\\\"http_timeout_ms\\\":1200000}}\"},{\"name\":\"heu\",\"type\":\"heu\",\"parties\":[\"alice\",\"carol\"],\"config\":\"{\\\"mode\\\": \\\"PHEU\\\", \\\"schema\\\": \\\"paillier\\\", \\\"key_size\\\": 2048}\"}],\"ray_fed_config\":{\"cross_silo_comm_backend\":\"brpc_link\"}},\"sf_node_eval_param\":{\"domain\":\"data_prep\",\"name\":\"psi\",\"version\":\"0.0.5\",\"attr_paths\":[\"protocol\",\"sort_result\",\"allow_duplicate_keys\",\"allow_duplicate_keys/yes/join_type\",\"allow_duplicate_keys/yes/join_type/left_join/left_side\",\"input/receiver_input/key\",\"input/sender_input/key\"],\"attrs\":[{\"s\":\"PROTOCOL_ECDH\"},{\"b\":true},{\"s\":\"yes\"},{\"s\":\"left_join\"},{\"ss\":[\"alice\"]},{\"ss\":[\"id1\"]},{\"ss\":[\"id2\"]}]},\"sf_input_ids\":[\"alice-table\",\"carol-table\"],\"sf_output_ids\":[\"psi-output\"],\"sf_output_uris\":[\"psi-output.csv\"]}"},{"alias":"job-split","appImage":"secretflow-image","dependencies":["job-psi"],"parties":[{"domainID":"alice"},{"domainID":"carol"}],"priority":100,"taskID":"job-split","taskInputConfig":"{\"sf_datasource_config\":{\"alice\":{\"id\":\"default-data-source\"},\"carol\":{\"id\":\"default-data-source\"}},\"sf_cluster_desc\":{\"parties\":[\"alice\",\"carol\"],\"devices\":[{\"name\":\"spu\",\"type\":\"spu\",\"parties\":[\"alice\",\"carol\"],\"config\":\"{\\\"runtime_config\\\":{\\\"protocol\\\":\\\"REF2K\\\",\\\"field\\\":\\\"FM64\\\"},\\\"link_desc\\\":{\\\"connect_retry_times\\\":60,\\\"connect_retry_interval_ms\\\":1000,\\\"brpc_channel_protocol\\\":\\\"http\\\",\\\"brpc_channel_connection_type\\\":\\\"pooled\\\",\\\"recv_timeout_ms\\\":1200000,\\\"http_timeout_ms\\\":1200000}}\"},{\"name\":\"heu\",\"type\":\"heu\",\"parties\":[\"alice\",\"carol\"],\"config\":\"{\\\"mode\\\": \\\"PHEU\\\", \\\"schema\\\": \\\"paillier\\\", \\\"key_size\\\": 2048}\"}],\"ray_fed_config\":{\"cross_silo_comm_backend\":\"brpc_link\"}},\"sf_node_eval_param\":{\"domain\":\"data_prep\",\"name\":\"train_test_split\",\"version\":\"0.0.1\",\"attr_paths\":[\"train_size\",\"test_size\",\"random_state\",\"shuffle\"],\"attrs\":[{\"f\":0.75},{\"f\":0.25},{\"i64\":1234},{\"b\":true}]},\"sf_output_uris\":[\"train-dataset.csv\",\"test-dataset.csv\"],\"sf_output_ids\":[\"train-dataset\",\"test-dataset\"],\"sf_input_ids\":[\"psi-output\"]}"}]}}
  creationTimestamp: "2024-09-10T02:06:32Z"
  generation: 1
  name: job-best-effort-linear
  namespace: cross-domain
  resourceVersion: "1043379"
  uid: 66bf4300-275b-46a8-94da-ae3df44609e0
spec:
  initiator: alice
  maxParallelism: 2
  scheduleMode: BestEffort
  tasks:
  - alias: job-psi
    appImage: secretflow-image
    parties:
    - domainID: alice
    - domainID: carol
    priority: 100
    taskID: job-psi
    taskInputConfig: '{"sf_datasource_config":{"alice":{"id":"default-data-source"},"carol":{"id":"default-data-source"}},"sf_cluster_desc":{"parties":["alice","carol"],"devices":[{"name":"spu","type":"spu","parties":["alice","carol"],"config":"{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"},{"name":"heu","type":"heu","parties":["alice","carol"],"config":"{\"mode\":
      \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"}],"ray_fed_config":{"cross_silo_comm_backend":"brpc_link"}},"sf_node_eval_param":{"domain":"data_prep","name":"psi","version":"0.0.5","attr_paths":["protocol","sort_result","allow_duplicate_keys","allow_duplicate_keys/yes/join_type","allow_duplicate_keys/yes/join_type/left_join/left_side","input/receiver_input/key","input/sender_input/key"],"attrs":[{"s":"PROTOCOL_ECDH"},{"b":true},{"s":"yes"},{"s":"left_join"},{"ss":["alice"]},{"ss":["id1"]},{"ss":["id2"]}]},"sf_input_ids":["alice-table","carol-table"],"sf_output_ids":["psi-output"],"sf_output_uris":["psi-output.csv"]}'
    tolerable: false
  - alias: job-split
    appImage: secretflow-image
    dependencies:
    - job-psi
    parties:
    - domainID: alice
    - domainID: carol
    priority: 100
    taskID: job-split
    taskInputConfig: '{"sf_datasource_config":{"alice":{"id":"default-data-source"},"carol":{"id":"default-data-source"}},"sf_cluster_desc":{"parties":["alice","carol"],"devices":[{"name":"spu","type":"spu","parties":["alice","carol"],"config":"{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"},{"name":"heu","type":"heu","parties":["alice","carol"],"config":"{\"mode\":
      \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"}],"ray_fed_config":{"cross_silo_comm_backend":"brpc_link"}},"sf_node_eval_param":{"domain":"data_prep","name":"train_test_split","version":"0.0.1","attr_paths":["train_size","test_size","random_state","shuffle"],"attrs":[{"f":0.75},{"f":0.25},{"i64":1234},{"b":true}]},"sf_output_uris":["train-dataset.csv","test-dataset.csv"],"sf_output_ids":["train-dataset","test-dataset"],"sf_input_ids":["psi-output"]}'
    tolerable: false
status:
  completionTime: "2024-09-10T02:06:32Z"
  conditions:
  - lastTransitionTime: "2024-09-10T02:06:32Z"
    message: Validate job failed, can't find party namespace carol under cluster,
      namespace "carol" not found
    reason: ValidateFailed
    status: "False"
    type: JobValidated
  lastReconcileTime: "2024-09-10T02:06:32Z"
  phase: Failed
  reason: KusciaJobValidateFailed
  startTime: "2024-09-10T02:06:32Z"
wangzul commented 1 month ago
  1. kubectl get cdr 查看一下路由配置情况
gshilei commented 1 month ago

alice 集群,必须存在 bob 和 carol domain bob 集群,必须存在 alice 和 carol domain carol 集群,必须存在 alice 和 bob domain

shnnosuke34725 commented 1 month ago

kubectl get cdr查看了路由配置,alice-carol、carol-alice的Ready下是空的,是需要让这里为True是吗

shnnosuke34725 commented 1 month ago

我看到部署教程中写到要建立授权需要两个节点能直连,但是如果部署alice-bob-carol节点转发的话alice和carol是不需要直连的,请问应该如何建立授权

wangzul commented 1 month ago

我看到部署教程中写到要建立授权需要两个节点能直连,但是如果部署alice-bob-carol节点转发的话alice和carol是不需要直连的,请问应该如何建立授权

可以参考https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/deployment/Docker_deployment_kuscia/deploy_p2p_cn#id4 添加一下domain,安装上方@gshilei说的

shnnosuke34725 commented 1 month ago

您好,不好意思再请教一下,我配置好cdr后运行任务显示status: approveStatus: alice: JobAccepted conditions:

gshilei commented 1 month ago
  1. 贴一下 job metadata 中的 annotation 和 status 的内容。
  2. 贴一下 kubectl get cdr 的内容
shnnosuke34725 commented 1 month ago

job metadata 中 annotion

annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"kuscia.secretflow/v1alpha1","kind":"KusciaJob","metadata":{"annotations":{},"name":"job-best-effort-linear","namespace":"cross-domain"},"spec":{"initiator":"alice","maxParallelism":2,"scheduleMode":"BestEffort","tasks":[{"alias":"job-psi","appImage":"secretflow-image","parties":[{"domainID":"alice"},{"domainID":"carol"}],"priority":100,"taskID":"job-psi","taskInputConfig":"{\"sf_datasource_config\":{\"alice\":{\"id\":\"default-data-source\"},\"carol\":{\"id\":\"default-data-source\"}},\"sf_cluster_desc\":{\"parties\":[\"alice\",\"carol\"],\"devices\":[{\"name\":\"spu\",\"type\":\"spu\",\"parties\":[\"alice\",\"carol\"],\"config\":\"{\\"runtime_config\\":{\\"protocol\\":\\"REF2K\\",\\"field\\":\\"FM64\\"},\\"link_desc\\":{\\"connect_retry_times\\":60,\\"connect_retry_interval_ms\\":1000,\\"brpc_channel_protocol\\":\\"http\\",\\"brpc_channel_connection_type\\":\\"pooled\\",\\"recv_timeout_ms\\":1200000,\\"http_timeout_ms\\":1200000}}\"},{\"name\":\"heu\",\"type\":\"heu\",\"parties\":[\"alice\",\"carol\"],\"config\":\"{\\"mode\\": \\"PHEU\\", \\"schema\\": \\"paillier\\", \\"key_size\\": 2048}\"}],\"ray_fed_config\":{\"cross_silo_comm_backend\":\"brpc_link\"}},\"sf_node_eval_param\":{\"domain\":\"data_prep\",\"name\":\"psi\",\"version\":\"0.0.5\",\"attr_paths\":[\"protocol\",\"sort_result\",\"allow_duplicate_keys\",\"allow_duplicate_keys/yes/join_type\",\"allow_duplicate_keys/yes/join_type/left_join/left_side\",\"input/receiver_input/key\",\"input/sender_input/key\"],\"attrs\":[{\"s\":\"PROTOCOL_ECDH\"},{\"b\":true},{\"s\":\"yes\"},{\"s\":\"left_join\"},{\"ss\":[\"alice\"]},{\"ss\":[\"id1\"]},{\"ss\":[\"id2\"]}]},\"sf_input_ids\":[\"alice-table\",\"carol-table\"],\"sf_output_ids\":[\"psi-output\"],\"sf_output_uris\":[\"psi-output.csv\"]}"},{"alias":"job-split","appImage":"secretflow-image","dependencies":["job-psi"],"parties":[{"domainID":"alice"},{"domainID":"carol"}],"priority":100,"taskID":"job-split","taskInputConfig":"{\"sf_datasource_config\":{\"alice\":{\"id\":\"default-data-source\"},\"carol\":{\"id\":\"default-data-source\"}},\"sf_cluster_desc\":{\"parties\":[\"alice\",\"carol\"],\"devices\":[{\"name\":\"spu\",\"type\":\"spu\",\"parties\":[\"alice\",\"carol\"],\"config\":\"{\\"runtime_config\\":{\\"protocol\\":\\"REF2K\\",\\"field\\":\\"FM64\\"},\\"link_desc\\":{\\"connect_retry_times\\":60,\\"connect_retry_interval_ms\\":1000,\\"brpc_channel_protocol\\":\\"http\\",\\"brpc_channel_connection_type\\":\\"pooled\\",\\"recv_timeout_ms\\":1200000,\\"http_timeout_ms\\":1200000}}\"},{\"name\":\"heu\",\"type\":\"heu\",\"parties\":[\"alice\",\"carol\"],\"config\":\"{\\"mode\\": \\"PHEU\\", \\"schema\\": \\"paillier\\", \\"key_size\\": 2048}\"}],\"ray_fed_config\":{\"cross_silo_comm_backend\":\"brpc_link\"}},\"sf_node_eval_param\":{\"domain\":\"data_prep\",\"name\":\"train_test_split\",\"version\":\"0.0.1\",\"attr_paths\":[\"train_size\",\"test_size\",\"random_state\",\"shuffle\"],\"attrs\":[{\"f\":0.75},{\"f\":0.25},{\"i64\":1234},{\"b\":true}]},\"sf_output_uris\":[\"train-dataset.csv\",\"test-dataset.csv\"],\"sf_output_ids\":[\"train-dataset\",\"test-dataset\"],\"sf_input_ids\":[\"psi-output\"]}"}]}} kuscia.secretflow/initiator: alice kuscia.secretflow/interconn-kuscia-parties: carol kuscia.secretflow/interconn-self-parties: alice kuscia.secretflow/self-cluster-as-initiator: "true" creationTimestamp: "2024-09-10T09:15:01Z" generation: 1 name: job-best-effort-linear namespace: cross-domain resourceVersion: "1083332" uid: 5fd9fdbe-2e3f-4e67-81ec-eac49c8e58ab

job metadata 中 status

status: approveStatus: alice: JobAccepted conditions:

alice容器中kubectl get cdr

NAME SOURCE DESTINATION HOST AUTHENTICATION READY carol-alice carol alice 192.168.123.89 Token True alice-carol alice carol 192.168.123.198 Token True alice-bob alice bob 192.168.123.93 Token True bob-alice bob alice Token True

bob容器中kubectl get cdr

NAME SOURCE DESTINATION HOST AUTHENTICATION READY bob-carol bob carol Token True carol-bob carol bob 192.168.123.93 Token True alice-carol alice carol 192.168.123.198 Token True carol-alice carol alice 192.168.123.89 Token True

carol容器中kubectl get cdr

NAME SOURCE DESTINATION HOST AUTHENTICATION READY bob-carol bob carol Token True carol-bob carol bob 192.168.123.93 Token True alice-carol alice carol 192.168.123.198 Token True carol-alice carol alice 192.168.123.89 Token True

wangzul commented 1 month ago

您好,不好意思再请教一下,我配置好cdr后运行任务显示status: approveStatus: alice: JobAccepted conditions:

  • lastTransitionTime: "2024-09-10T08:38:47Z" status: "True" type: JobValidated lastReconcileTime: "2024-09-10T08:38:47Z" phase: AwaitingApproval stageStatus: alice: JobCreateStageSucceeded startTime: "2024-09-10T08:38:47Z" 请问这是什么问题,经过检查domaindatagrant都配置好了

job metadata 中 annotion

annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"kuscia.secretflow/v1alpha1","kind":"KusciaJob","metadata":{"annotations":{},"name":"job-best-effort-linear","namespace":"cross-domain"},"spec":{"initiator":"alice","maxParallelism":2,"scheduleMode":"BestEffort","tasks":[{"alias":"job-psi","appImage":"secretflow-image","parties":[{"domainID":"alice"},{"domainID":"carol"}],"priority":100,"taskID":"job-psi","taskInputConfig":"{"sf_datasource_config":{"alice":{"id":"default-data-source"},"carol":{"id":"default-data-source"}},"sf_cluster_desc":{"parties":["alice","carol"],"devices":[{"name":"spu","type":"spu","parties":["alice","carol"],"config":"{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"},{"name":"heu","type":"heu","parties":["alice","carol"],"config":"{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"}],"ray_fed_config":{"cross_silo_comm_backend":"brpc_link"}},"sf_node_eval_param":{"domain":"data_prep","name":"psi","version":"0.0.5","attr_paths":["protocol","sort_result","allow_duplicate_keys","allow_duplicate_keys/yes/join_type","allow_duplicate_keys/yes/join_type/left_join/left_side","input/receiver_input/key","input/sender_input/key"],"attrs":[{"s":"PROTOCOL_ECDH"},{"b":true},{"s":"yes"},{"s":"left_join"},{"ss":["alice"]},{"ss":["id1"]},{"ss":["id2"]}]},"sf_input_ids":["alice-table","carol-table"],"sf_output_ids":["psi-output"],"sf_output_uris":["psi-output.csv"]}"},{"alias":"job-split","appImage":"secretflow-image","dependencies":["job-psi"],"parties":[{"domainID":"alice"},{"domainID":"carol"}],"priority":100,"taskID":"job-split","taskInputConfig":"{"sf_datasource_config":{"alice":{"id":"default-data-source"},"carol":{"id":"default-data-source"}},"sf_cluster_desc":{"parties":["alice","carol"],"devices":[{"name":"spu","type":"spu","parties":["alice","carol"],"config":"{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"},{"name":"heu","type":"heu","parties":["alice","carol"],"config":"{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"}],"ray_fed_config":{"cross_silo_comm_backend":"brpc_link"}},"sf_node_eval_param":{"domain":"data_prep","name":"train_test_split","version":"0.0.1","attr_paths":["train_size","test_size","random_state","shuffle"],"attrs":[{"f":0.75},{"f":0.25},{"i64":1234},{"b":true}]},"sf_output_uris":["train-dataset.csv","test-dataset.csv"],"sf_output_ids":["train-dataset","test-dataset"],"sf_input_ids":["psi-output"]}"}]}} kuscia.secretflow/initiator: alice kuscia.secretflow/interconn-kuscia-parties: carol kuscia.secretflow/interconn-self-parties: alice kuscia.secretflow/self-cluster-as-initiator: "true" creationTimestamp: "2024-09-10T09:15:01Z" generation: 1 name: job-best-effort-linear namespace: cross-domain resourceVersion: "1083332" uid: 5fd9fdbe-2e3f-4e67-81ec-eac49c8e58ab

job metadata 中 status

status: approveStatus: alice: JobAccepted conditions: - lastTransitionTime: "2024-09-10T09:15:01Z" status: "True" type: JobValidated lastReconcileTime: "2024-09-10T09:15:01Z" phase: AwaitingApproval stageStatus: alice: JobCreateStageSucceeded startTime: "2024-09-10T09:15:01Z"

alice容器中kubectl get cdr

NAME SOURCE DESTINATION HOST AUTHENTICATION READY carol-alice carol alice 192.168.123.89 Token True alice-carol alice carol 192.168.123.198 Token True alice-bob alice bob 192.168.123.93 Token True bob-alice bob alice Token True

bob容器中kubectl get cdr

NAME SOURCE DESTINATION HOST AUTHENTICATION READY bob-carol bob carol Token True carol-bob carol bob 192.168.123.93 Token True alice-carol alice carol 192.168.123.198 Token True carol-alice carol alice 192.168.123.89 Token True

carol容器中kubectl get cdr

NAME SOURCE DESTINATION HOST AUTHENTICATION READY bob-carol bob carol Token True carol-bob carol bob 192.168.123.93 Token True alice-carol alice carol 192.168.123.198 Token True carol-alice carol alice 192.168.123.89 Token True

  1. kubectl get pod -A 看一下 具体的信息可通过kubectl get pod -n (nameSpace有的话) {name} -oyaml
  2. 这个是排查方法https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed
  3. 参考文档提供cat /home/kuscia/var/stdout/pods/podName_xxxx/xxxx/x.log
shnnosuke34725 commented 1 month ago

kubectl get pod -A显示No resources found

gshilei commented 1 month ago

从下面看,这个 job 参与方只有2个,alice 和 carol。alice 这边已经审批通过了,没有收到 carol 方的审批通过状态。可以在 carol 节点中查看是否有该 job,如果有该job,那么在 kuscia.log 中查询下 jobID,看看是否有什么报错信息。 image

shnnosuke34725 commented 1 month ago

我查了下 carol 节点中没有该 job

gshilei commented 1 month ago
  1. carol 中使用命令 kubectl get interop 看一下,是否有 carol - alice 的 interop
  2. 看下 kuscia.log 中查看是否有异常的报错
shnnosuke34725 commented 1 month ago

1.kubectl get interop结果: NAME AGE carol-2-bob 5d3h carol-2-alice 3h13m 2.以下是最后几行的报错不知道是不是这个: .26.11/tools/cache/reflector.go:169: failed to list v1alpha1.KusciaDeployment: the server has asked for the client to provide credentials (get kusciadeployments.kuscia.secretflow) 2024-09-10 19:40:02.070 INFO nlog/nlog.go:77 W0910 19:40:02.070399 827 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.11/tools/cache/reflector.go:169: failed to list v1alpha1.KusciaDeployment: the server has asked for the client to provide credentials (get kusciadeployments.kuscia.secretflow) 2024-09-10 19:40:02.070 INFO nlog/nlog.go:77 E0910 19:40:02.070648 827 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.11/tools/cache/reflector.go:169: Failed to watch v1alpha1.KusciaDeployment: failed to list v1alpha1.KusciaDeployment: the server has asked for the client to provide credentials (get kusciadeployments.kuscia.secretflow) 2024-09-10 19:40:02.070 INFO nlog/nlog.go:77 E0910 19:40:02.070648 827 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.11/tools/cache/reflector.go:169: Failed to watch v1alpha1.KusciaDeployment: failed to list v1alpha1.KusciaDeployment: the server has asked for the client to provide credentials (get kusciadeployments.kuscia.secretflow) 2024-09-10 19:40:02.070 INFO nlog/nlog.go:77 E0910 19:40:02.070648 827 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.11/tools/cache/reflector.go:169: Failed to watch v1alpha1.KusciaDeployment: failed to list v1alpha1.KusciaDeployment: the server has asked for the client to provide credentials (get kusciadeployments.kuscia.secretflow)

gshilei commented 1 month ago
  1. 可能是 alice 节点中,创建的 carol-alice 的 cdr 有问题,可以贴一下该内容
  2. 在 alice 节点中,获取下 domain carol 的内容,kubectl get domain carol -o yaml
shnnosuke34725 commented 1 month ago
  1. carol-alice的cdr(不知道是不是指这个): apiVersion: kuscia.secretflow/v1alpha1 kind: ClusterDomainRoute metadata: name: carol-alice spec: authenticationType: Token source: carol destination: alice endpoint: host: 192.168.123.89 ports:

    • name: http port: 11080 protocol: HTTP isTLS: true transit: transitMethod: THIRD-DOMAIN domain: domainID: bob tokenConfig: tokenGenMethod: RSA-GEN rollingUpdatePeriod: 86400
  2. apiVersion: kuscia.secretflow/v1alpha1 kind: Domain metadata: annotations: domain/carol: kuscia.secretflow/domain-type=embedded kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"kuscia.secretflow/v1alpha1","kind":"Domain","metadata":{"annotations":{"domain/carol":"kuscia.secretflow/domain-type=embedded"},"name":"carol"},"spec":{"authCenter":{"authenticationType":"Token","tokenGenMethod":"RSA-GEN"},"cert":"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQVRBTkJna3Foa2lHOXcwQkFRc0ZBREFRTVE0d0RBWURWUVFERXdWallYSnYKYkRBZ0Z3MHlOREE1TURNd056VTNNRGxhR0E4eU1EYzBNRGt3TXpBM05UY3dPVm93RURFT01Bd0dBMVVFQXhNRgpZMkZ5YjJ3d2dnRWlNQTBHQ1NxR1NJYjNEUUVCQVFVQUE0SUJEd0F3Z2dFS0FvSUJBUURpV2dXckgyM0hQaEVWCllRWVNaSnFXYWM0aXBOSW4xbkoxdk9VWnk5NEhreWlFRElUdXVaZG8rQlpsS1Q5bmVQM3c4QTVPQ2lmQWMzN3oKeDVJZkhGQVhvUWk4U2orR0hiSTVFbE1RNFhvMjF3Ulh1TDRFc0hHaG8yckNFQ2d6OGhVTzB0YUlhWm1SNVBOMQpHNXRPeExKY1FNeVptTnEzNXVqRXpGNkRhU2tmdU9JdjUvTTgyT3M2WlFyR1pZT3ZrUDF5aWExVHZVQ0pLTFJiClc3a1N6TXVtSStwMnRLUzc3eUJENVR4cjdCdWEwR0VzZWlCdDBpYVZwU1pOT3VFMTI0UzAzWW5RbUl6Sm5CMWgKbDVBWnNpdkFRTWNRaXluT3hpdnBhL3FuOHFRQnB6ajZIdkQ1RUxCSkdya3NVVmszMVJ2RDlpay9tMTZVVUtObAphL2FjVE1jRkFnTUJBQUdqWVRCZk1BNEdBMVVkRHdFQi93UUVBd0lDaERBZEJnTlZIU1VFRmpBVUJnZ3JCZ0VGCkJRY0RBZ1lJS3dZQkJRVUhBd0V3RHdZRFZSMFRBUUgvQkFVd0F3RUIvekFkQmdOVkhRNEVGZ1FVZTlWMWR3aEcKcEtsbW5vbUtLSTZvVTZMMVlmQXdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBRE93TW0zcHFDbGFDZ0hiMWxUKwp1WFRsMlNPSUFLQUZVWit4YmpHQ1NDKzR0K1pkNm8vRUJ3SVhTaWJ0ZENXVDRTNnJ6QndFNStNL0MwQXZaSlY4CkRKMyswR1o4Q3pyVTY2NnFQaG5GRUZZZythNWsrc2FCTER3WC80TkE0ZElucDFWUDRzYmRqYnlaemxIbEZ3K1UKaUlSQWxZNWs5ZGNyNUxiM082U1ZlR015Qm4vV2YxRE5BZXBDVG5RQWxoTjFpVjJkRFZnWGVuTHAvdWJPS1ErQwp2K0VwZFEwd2RCZEk5NS9JZHd2c29xTWI4T2kvaXgrZjhzTlRTUzBSZGtQcTFDV1Fiei9SekNzWDNJcXFMNnU2CjVjeFBDWEdsSm9VZUNjSUtUTzdQWE8xazVBY3NOTDlQTEFmWUJuanlVVDJUbG9ydlNrcnZmenZwblloN25sVkoKRk9BPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==","interConnProtocols":["kuscia"],"master":"carol","role":"partner"}} creationTimestamp: "2024-09-10T07:48:00Z" generation: 1 labels: kuscia.secretflow/domain-auth: completed name: carol resourceVersion: "1075160" uid: 96725c48-1a2a-46bd-92e9-2659913f7660 spec: authCenter: authenticationType: Token tokenGenMethod: RSA-GEN cert: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQVRBTkJna3Foa2lHOXcwQkFRc0ZBREFRTVE0d0RBWURWUVFERXdWallYSnYKYkRBZ0Z3MHlOREE1TURNd056VTNNRGxhR0E4eU1EYzBNRGt3TXpBM05UY3dPVm93RURFT01Bd0dBMVVFQXhNRgpZMkZ5YjJ3d2dnRWlNQTBHQ1NxR1NJYjNEUUVCQVFVQUE0SUJEd0F3Z2dFS0FvSUJBUURpV2dXckgyM0hQaEVWCllRWVNaSnFXYWM0aXBOSW4xbkoxdk9VWnk5NEhreWlFRElUdXVaZG8rQlpsS1Q5bmVQM3c4QTVPQ2lmQWMzN3oKeDVJZkhGQVhvUWk4U2orR0hiSTVFbE1RNFhvMjF3Ulh1TDRFc0hHaG8yckNFQ2d6OGhVTzB0YUlhWm1SNVBOMQpHNXRPeExKY1FNeVptTnEzNXVqRXpGNkRhU2tmdU9JdjUvTTgyT3M2WlFyR1pZT3ZrUDF5aWExVHZVQ0pLTFJiClc3a1N6TXVtSStwMnRLUzc3eUJENVR4cjdCdWEwR0VzZWlCdDBpYVZwU1pOT3VFMTI0UzAzWW5RbUl6Sm5CMWgKbDVBWnNpdkFRTWNRaXluT3hpdnBhL3FuOHFRQnB6ajZIdkQ1RUxCSkdya3NVVmszMVJ2RDlpay9tMTZVVUtObAphL2FjVE1jRkFnTUJBQUdqWVRCZk1BNEdBMVVkRHdFQi93UUVBd0lDaERBZEJnTlZIU1VFRmpBVUJnZ3JCZ0VGCkJRY0RBZ1lJS3dZQkJRVUhBd0V3RHdZRFZSMFRBUUgvQkFVd0F3RUIvekFkQmdOVkhRNEVGZ1FVZTlWMWR3aEcKcEtsbW5vbUtLSTZvVTZMMVlmQXdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBRE93TW0zcHFDbGFDZ0hiMWxUKwp1WFRsMlNPSUFLQUZVWit4YmpHQ1NDKzR0K1pkNm8vRUJ3SVhTaWJ0ZENXVDRTNnJ6QndFNStNL0MwQXZaSlY4CkRKMyswR1o4Q3pyVTY2NnFQaG5GRUZZZythNWsrc2FCTER3WC80TkE0ZElucDFWUDRzYmRqYnlaemxIbEZ3K1UKaUlSQWxZNWs5ZGNyNUxiM082U1ZlR015Qm4vV2YxRE5BZXBDVG5RQWxoTjFpVjJkRFZnWGVuTHAvdWJPS1ErQwp2K0VwZFEwd2RCZEk5NS9JZHd2c29xTWI4T2kvaXgrZjhzTlRTUzBSZGtQcTFDV1Fiei9SekNzWDNJcXFMNnU2CjVjeFBDWEdsSm9VZUNjSUtUTzdQWE8xazVBY3NOTDlQTEFmWUJuanlVVDJUbG9ydlNrcnZmenZwblloN25sVkoKRk9BPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== interConnProtocols:

    • kuscia master: carol role: partner status: deployTokenStatuses:
    • lastTransitionTime: "2024-09-10T07:48:00Z" state: unused token: 6jQH9Y5BFiXfW8L2pDLl2yu2If7abEdI
gshilei commented 1 month ago

应该是上面 cdr 创建的有问题,后续有相关同学会继续跟进

shnnosuke34725 commented 1 month ago

好的麻烦了,非常感谢!

LeeTheByRiver commented 1 month ago

这个问题可能是触发了转发流程的bug导致token验证失败,确认后我再同步具体情况

LeeTheByRiver commented 1 month ago

这个问题已经复现和确认,对应的fix会跟随后续的版本发布

github-actions[bot] commented 1 week ago

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.