secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
72 stars 49 forks source link

在k8s部署kuscia中心模式,执行测试任务一直running #344

Open magic-hya opened 2 months ago

magic-hya commented 2 months ago

Issue Type

Running

Search for existing issues similar to yours

Yes

OS Platform and Distribution

CentOS Linux 7

Kuscia Version

latest

Deployment

k8s

deployment Version

k8s 1.22.2

App Running type

secretflow

App Running version

latest

Configuration file used to run kuscia.

参照官方k8s部署中心化kuscia配置部署

What happend and What you expected to happen.

## 任务发起成功
[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# scripts/user/create_example_job.sh
/home/kuscia /home/kuscia
With JOB_EXAMPLE=PSI, job via APP_IMAGE=secretflow-image creating ...
kusciajob.kuscia.secretflow/secretflow-task-20240620111611 created
Job 'secretflow-task-20240620111611' created successfully. You can use the following command to display job status:
  kubectl get kj -n cross-domain

## 查看任务一直在运行
[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl get kj -n cross-domain
NAME                             STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
secretflow-task-20240620111611   12s                          11s                 Running

Kuscia log output.

## 查看alice和bob pod运行状况有error
[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl logs secretflow-task-20240620111611-single-psi-0 -n alice
Error from server: Get "https://192.168.62.11:10250/containerLogs/alice/secretflow-task-20240620111611-single-psi-0/secretflow": proxy error from 0.0.0.0:6443 while dialing 192.168.62.11:10250, code 502: 502 Bad Gateway

[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl logs secretflow-task-20240620111611-single-psi-0 -n bob
Error from server: Get "https://192.168.62.12:10250/containerLogs/bob/secretflow-task-20240620111611-single-psi-0/secretflow": proxy error from 0.0.0.0:6443 while dialing 192.168.62.12:10250, code 502: 502 Bad Gateway
zimu-yuxi commented 2 months ago

alice和bob的路由看下呢? kubectl get cdr -A

magic-hya commented 2 months ago

[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl get cdr -A NAME SOURCE DESTINATION HOST AUTHENTICATION READY alice-bob alice bob kuscia-lite-bob.lite-bob.svc.cluster.local Token True bob-alice bob alice kuscia-lite-alice.lite-alice.svc.cluster.local Token True alice-kuscia-system alice kuscia-system Token True bob-kuscia-system bob kuscia-system Token True

zimu-yuxi commented 2 months ago

[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl get cdr -A NAME SOURCE DESTINATION HOST AUTHENTICATION READY alice-bob alice bob kuscia-lite-bob.lite-bob.svc.cluster.local Token True bob-alice bob alice kuscia-lite-alice.lite-alice.svc.cluster.local Token True alice-kuscia-system alice kuscia-system Token True bob-kuscia-system bob kuscia-system Token True

/home/kuscia/var/stdout目录下看下有没有相关的日志输出

magic-hya commented 2 months ago

目录下没有内容 [root@kuscia-master-76c5b5bc7b-s84k8 stdout]# pwd /home/kuscia/var/stdout [root@kuscia-master-76c5b5bc7b-s84k8 stdout]# ls -l total 0

magic-hya commented 2 months ago

我想进bob容器看一看的,也报这个错,这错误是不是有关联 [root@kuscia-master-76c5b5bc7b-s84k8 stdout]# kubectl exec -it secretflow-task-20240620111611-single-psi-0 -n bob -- bash Error from server: error dialing backend: proxy error from 0.0.0.0:6443 while dialing 192.168.62.12:10250, code 502: 502 Bad Gateway

aokaokd commented 2 months ago

请检查下你的网关是否配置了安全策略。 也可以看下bob 容器的状态

kubectl describe pod   {bob容器} -n  xxx
magic-hya commented 2 months ago
[root@kuscia-master-76c5b5bc7b-s84k8 stdout]# kubectl describe pod secretflow-task-20240620111611-single-psi-0 -n bob
Name:             secretflow-task-20240620111611-single-psi-0
Namespace:        bob
Priority:         0
Service Account:  default
Node:             kuscia-lite-bob-69bd6df646-k8krs/192.168.62.12
Start Time:       Thu, 20 Jun 2024 11:17:57 +0800
Labels:           kuscia.secretflow/communication-role-client=true
                  kuscia.secretflow/communication-role-server=true
                  kuscia.secretflow/controller=kusciatask
                  kuscia.secretflow/pod-identity=0cddc892-e259-4af0-b393-fc7b1d66bf40-0
                  kuscia.secretflow/pod-role=
                  kuscia.secretflow/task-resource-uid=56940863-4cf4-4978-81b0-2483eb75f62f
                  kuscia.secretflow/task-uid=0cddc892-e259-4af0-b393-fc7b1d66bf40
Annotations:      kuscia.secretflow/config-template-volumes: config-template
                  kuscia.secretflow/image-id: abc
                  kuscia.secretflow/initiator: alice
                  kuscia.secretflow/task-id: secretflow-task-20240620111611-single-psi
                  kuscia.secretflow/task-resource: secretflow-task-20240620111611-single-psi-f6623141c609
                  kuscia.secretflow/task-resource-group: secretflow-task-20240620111611-single-psi
Status:           Pending
IP:
IPs:              <none>
Containers:
  secretflow:
    Container ID:
    Image:         secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.6.0b0
    Image ID:
    Ports:         21320/TCP, 21321/TCP, 21316/TCP, 21317/TCP, 21318/TCP, 21319/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      sh
    Args:
      -c
      python -m secretflow.kuscia.entry ./kuscia/task-config.conf
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      KUSCIA_DOMAIN_ID:                   bob
      TASK_ID:                            secretflow-task-20240620111611-single-psi
      TASK_CLUSTER_DEFINE:                {"parties":[{"name":"bob","role":"","services":[{"portName":"spu","endpoints":["secretflow-task-20240620111611-single-psi-0-spu.bob.svc"]},{"portName":"fed","endpoints":["secretflow-task-20240620111611-single-psi-0-fed.bob.svc"]},{"portName":"global","endpoints":["secretflow-task-20240620111611-single-psi-0-global.bob.svc:21316"]}]},{"name":"alice","role":"","services":[{"portName":"spu","endpoints":["secretflow-task-20240620111611-single-psi-0-spu.alice.svc"]},{"portName":"fed","endpoints":["secretflow-task-20240620111611-single-psi-0-fed.alice.svc"]},{"portName":"global","endpoints":["secretflow-task-20240620111611-single-psi-0-global.alice.svc:21565"]}]}],"selfPartyIdx":0,"selfEndpointIdx":0}
      ALLOCATED_PORTS:                    {"ports":[{"name":"spu","port":21320,"scope":"Cluster","protocol":"GRPC"},{"name":"fed","port":21321,"scope":"Cluster","protocol":"GRPC"},{"name":"global","port":21316,"scope":"Domain","protocol":"GRPC"},{"name":"node-manager","port":21317,"scope":"Local","protocol":"GRPC"},{"name":"object-manager","port":21318,"scope":"Local","protocol":"GRPC"},{"name":"client-server","port":21319,"scope":"Local","protocol":"GRPC"}]}
      TASK_INPUT_CONFIG:                  {"sf_datasource_config":{"alice":{"id":"default-data-source"},"bob":{"id":"default-data-source"}},"sf_cluster_desc":{"parties":["alice","bob"],"devices":[{"name":"spu","type":"spu","parties":["alice","bob"],"config":"{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"},{"name":"heu","type":"heu","parties":["alice","bob"],"config":"{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"}],"ray_fed_config":{"cross_silo_comm_backend":"brpc_link"}},"sf_node_eval_param":{"domain":"data_prep","name":"psi","version":"0.0.4","attr_paths":["input/receiver_input/key","input/sender_input/key","protocol","precheck_input","bucket_size","curve_type","left_side"],"attrs":[{"ss":["id1"]},{"ss":["id2"]},{"s":"PROTOCOL_ECDH"},{"b":true},{"i64":"1048576"},{"s":"CURVE_FOURQ"},{"is_na":false,"ss":["alice"]}]},"sf_input_ids":["alice-table","bob-table"],"sf_output_ids":["psi-output"],"sf_output_uris":["psi-output.csv"]}
      KUSCIA_PORT_SPU_NUMBER:             21320
      KUSCIA_PORT_FED_NUMBER:             21321
      KUSCIA_PORT_GLOBAL_NUMBER:          21316
      KUSCIA_PORT_NODE_MANAGER_NUMBER:    21317
      KUSCIA_PORT_OBJECT_MANAGER_NUMBER:  21318
      KUSCIA_PORT_CLIENT_SERVER_NUMBER:   21319
    Mounts:
      /root/kuscia/task-config.conf from config-template (rw,path="task-config.conf")
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config-template:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        secretflow-task-20240620111611-single-psi-configtemplate
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kuscia.secretflow/namespace=bob
Tolerations:     kuscia.secretflow/agent:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>
magic-hya commented 2 months ago

其实还有个地方我有疑问的,在配置lite的runtime配置runk时,我使用了默认的kubeconfigFile 然后我也没有配置RBAC,不知道是否会影响到

    # 当 runtime 为 runk 时配置
    runk:
      # 任务调度到指定的机构 k8s namespace 下
      namespace: lite-bob
      # 机构 k8s 集群的 pod dns 配置, 用于解析节点的应用域名, runk 拉起 pod 所使用的 dns 地址,应配置为 kuscia-lite service 的 clusterIP, 此处以 "1.1.1.1" 为例
      dnsServers:
      # - kuscia-dns-lb-server
        - 10.105.200.142
      # k8s 集群的 kubeconfig, 不填默认 serviceaccount; 当前请不填,默认使用 serviceaccount
      kubeconfigFile:
aokaokd commented 2 months ago

你的kubeconfigFile 没有问题,但是需要配置下 RBAC,参考: RBAC 另外 可以看到bob的 event内没有信息 ,也可以看一下bob的日志

kubectl logs secretflow-task-20240620111611-single-psi-0 -n bob

如果没有日志信息,可以看下k8s节点的状态:

kubectl get nodes
magic-hya commented 2 months ago
[root@kuscia-master-76c5b5bc7b-s84k8 stdout]# kubectl logs secretflow-task-20240620111611-single-psi-0 -n bob
Error from server: Get "https://192.168.62.12:10250/containerLogs/bob/secretflow-task-20240620111611-single-psi-0/secretflow": proxy error from 0.0.0.0:6443 while dialing 192.168.62.12:10250, code 502: 502 Bad Gateway
[root@kuscia-master-76c5b5bc7b-s84k8 stdout]# kubectl get nodes
NAME                                STATUS     ROLES   AGE   VERSION
kuscia-lite-bob-69bd6df646-k8krs    NotReady   agent   29h   578bc84
kuscia-lite-alice-b69cbcc76-f2lg6   NotReady   agent   46h   578bc84

我先把rbac配置了再看看

yushiqie commented 2 months ago

我想进bob容器看一看的,也报这个错,这错误是不是有关联 [root@kuscia-master-76c5b5bc7b-s84k8 stdout]# kubectl exec -it secretflow-task-20240620111611-single-psi-0 -n bob -- bash Error from server: error dialing backend: proxy error from 0.0.0.0:6443 while dialing 192.168.62.12:10250, code 502: 502 Bad Gateway

kuscia 中 pod ,不支持 kubectl logs 和 exec 功能(应安全要求),如果要查看任务 pod 日志,请到 lite 节点侧查看 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.8.0b0/deployment/logdescription

magic-hya commented 2 months ago

启动了rbac后,执行任务,现在开始报错了

kubectl describe pod secretflow-task-20240620172156-single-psi-0 -n alice

...
Events:
  Type     Reason            Age    From              Message
  ----     ------            ----   ----              -------
  Warning  FailedScheduling  4m29s  kuscia-scheduler  0/2 nodes are available: waiting for task resource. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling., can not find related task resource.
  Warning  FailedScheduling  4m28s  kuscia-scheduler  0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. reject the pod secretflow-task-20240620172156-single-psi-0 even after PostFilter, preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  2m56s  kuscia-scheduler
  Warning  FailedScheduling  2m52s  kuscia-scheduler  0/2 nodes are available: task resource alice/secretflow-task-20240620172156-single-psi-64164e39cac5 status phase is Failed, skip scheduling pod. last failed scheduling result: . preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling., reject the pod secretflow-task-20240620172156-single-psi-0 even after PostFilter.
aokaokd commented 2 months ago

启动了rbac后,执行任务,现在开始报错了

kubectl describe pod secretflow-task-20240620172156-single-psi-0 -n alice

...
Events:
  Type     Reason            Age    From              Message
  ----     ------            ----   ----              -------
  Warning  FailedScheduling  4m29s  kuscia-scheduler  0/2 nodes are available: waiting for task resource. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling., can not find related task resource.
  Warning  FailedScheduling  4m28s  kuscia-scheduler  0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. reject the pod secretflow-task-20240620172156-single-psi-0 even after PostFilter, preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  2m56s  kuscia-scheduler
  Warning  FailedScheduling  2m52s  kuscia-scheduler  0/2 nodes are available: task resource alice/secretflow-task-20240620172156-single-psi-64164e39cac5 status phase is Failed, skip scheduling pod. last failed scheduling result: . preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling., reject the pod secretflow-task-20240620172156-single-psi-0 even after PostFilter.

你的两个节点存在污点,可以执行一下命令看看是什么污点

kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'
或者
kubectl describe node kuscia-lite-bob-69bd6df646-k8krs
magic-hya commented 2 months ago

之前容器内的alice和pod的任务节点卡住了,我就用--force删除了,有可能是这个原因

[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'
{
  "name": "kuscia-lite-bob-69bd6df646-k8krs",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "kuscia.secretflow/agent",
      "value": "v1"
    }
  ]
}
{
  "name": "kuscia-lite-alice-b69cbcc76-f2lg6",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "kuscia.secretflow/agent",
      "value": "v1"
    },
    {
      "effect": "NoSchedule",
      "key": "node.kubernetes.io/unreachable",
      "timeAdded": "2024-06-20T10:03:45Z"
    },
    {
      "effect": "NoExecute",
      "key": "node.kubernetes.io/unreachable",
      "timeAdded": "2024-06-20T10:03:50Z"
    }
  ]
}
[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl describe node kuscia-lite-bob-69bd6df646-k8krs
Name:               kuscia-lite-bob-69bd6df646-k8krs
Roles:              agent
Labels:             beta.kubernetes.io/arch=x86_64
                    beta.kubernetes.io/os=linux
                    domain=bob
                    kubernetes.io/apiVersion=0.26.6
                    kubernetes.io/arch=x86_64
                    kubernetes.io/hostname=kuscia-lite-bob-69bd6df646-k8krs
                    kubernetes.io/os=linux
                    kubernetes.io/role=agent
                    kuscia.secretflow/namespace=bob
                    kuscia.secretflow/runtime=runk
Annotations:        node.alpha.kubernetes.io/ttl: 0
CreationTimestamp:  Wed, 19 Jun 2024 11:22:44 +0800
Taints:             kuscia.secretflow/agent=v1:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  kuscia-lite-bob-69bd6df646-k8krs
  AcquireTime:     <unset>
  RenewTime:       Thu, 20 Jun 2024 18:09:32 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                  Message
  ----                 ------  -----------------                 ------------------                ------                  -------
  NetworkUnavailable   False   Wed, 19 Jun 2024 11:22:44 +0800   Wed, 19 Jun 2024 11:22:44 +0800   RouteCreated            RouteController created a route
  PIDPressure          False   Wed, 19 Jun 2024 11:22:44 +0800   Wed, 19 Jun 2024 11:22:44 +0800   AgentHasSufficientPID   Agent has sufficient PID available
  Ready                True    Thu, 20 Jun 2024 18:07:49 +0800   Thu, 20 Jun 2024 17:17:54 +0800   AgentReady              Agent is ready
Addresses:
  InternalIP:  192.168.62.12
Capacity:
  cpu:      4
  memory:   4Gi
  pods:     500
  storage:  100Gi
Allocatable:
  cpu:      4
  memory:   4Gi
  pods:     500
  storage:  100Gi
System Info:
  Machine ID:                 c0e61932-d279-5b11-bf44-42bf1a8a4f09
  System UUID:
  Boot ID:                    1709087859-1718767364341055751
  Kernel Version:             5.4.181-1.el7.elrepo.x86_64
  OS Image:                   docker://linux/anolis:23 (guest)
  Operating System:           linux
  Architecture:               x86_64
  Container Runtime Version:
  Kubelet Version:            578bc84
  Kube-Proxy Version:
PodCIDR:                      10.42.1.0/24
PodCIDRs:                     10.42.1.0/24
Non-terminated Pods:          (0 in total)
  Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
  storage            0         0
Events:              <none>
magic-hya commented 2 months ago

删除污点后,重新启动任务发现2个问题分别是: bob节点

[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl get pods -n bob
NAME                                          READY   STATUS             RESTARTS   AGE
secretflow-task-20240621113741-single-psi-0   0/1     ImagePullBackOff   0          4m39s

Events:
  Type     Reason            Age    From              Message
  ----     ------            ----   ----              -------
  Warning  FailedScheduling  4m35s  kuscia-scheduler  0/2 nodes are available: waiting for task resource. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling., can not find related task resource.
  Normal   Scheduled         4m12s  kuscia-scheduler  Successfully assigned bob/secretflow-task-20240621113741-single-psi-0 to kuscia-lite-bob-69bd6df646-k8krs

alice节点

[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl get pods -n alice
NAME                                          READY   STATUS              RESTARTS   AGE
secretflow-task-20240621113741-single-psi-0   0/1     ContainerCreating   0          5m53s

Events:
  Type     Reason                  Age    From              Message
  ----     ------                  ----   ----              -------
  Warning  FailedScheduling        2m52s  kuscia-scheduler  0/2 nodes are available: waiting for task resource. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling., can not find related task resource.
  Warning  FailedScheduling        2m51s  kuscia-scheduler  0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling., reject the pod secretflow-task-20240621113741-single-psi-0 even after PostFilter.
  Normal   Scheduled               2m29s  kuscia-scheduler  Successfully assigned alice/secretflow-task-20240621113741-single-psi-0 to kuscia-lite-alice-b69cbcc76-f2lg6
  Warning  FailedCreatePodSandBox  2m28s  Agent             Failed to sync pod to k8s: failed to create configmap secretflow-task-20240621113741-single-psi-0-secretflow-task-20240621113741-single-psi-configtemplate, detail-> failed to create resource(*v1.ConfigMap) secretflow-task-20240621113741-single-psi-0-secretflow-task-20240621113741-single-psi-configtemplate, detail-> configmaps is forbidden: User "system:serviceaccount:lite-alice:default" cannot create resource "configmaps" in API group "" in the namespace "lite-alice"
aokaokd commented 2 months ago

alice有权限问题:FailedCreatePodSandBox 事件指出创建ConfigMap时存在权限问题。服务账户 lite-alice:default 缺少在 lite-alice 命名空间中创建ConfigMap所需的权限。

magic-hya commented 2 months ago

alice有权限问题:FailedCreatePodSandBox 事件指出创建ConfigMap时存在权限问题。服务账户 lite-alice:default 缺少在 lite-alice 命名空间中创建ConfigMap所需的权限。

curl -X POST 'http://127.0.0.1:8082/api/v1/domaindatagrant/create' \
     --cert /home/kuscia/var/certs/kusciaapi-server.crt \
     --key /home/kuscia/var/certs/kusciaapi-server.key \
     --cacert /home/kuscia/var/certs/ca.crt \
     --header "Token: $(cat /home/kuscia/var/certs/token)" \
     --header 'Content-Type: application/json' \
     -d '{ "grant_domain": "bob",
           "description": {"domaindatagrant":"alice-bob"},
           "domain_id": "alice",
           "domaindata_id": "alice-table"
     }'

这里我不是很理解,kusciaapi-server.crt和kusciaapi-server.key目录下并没有,还有/home/kuscia/var/certs/token也没有,这几个文件是如何生成的,是不是这个原因导致的权限不够

aokaokd commented 2 months ago

这个

alice有权限问题:FailedCreatePodSandBox 事件指出创建ConfigMap时存在权限问题。服务账户 lite-alice:default 缺少在 lite-alice 命名空间中创建ConfigMap所需的权限。

curl -X POST 'http://127.0.0.1:8082/api/v1/domaindatagrant/create' \
     --cert /home/kuscia/var/certs/kusciaapi-server.crt \
     --key /home/kuscia/var/certs/kusciaapi-server.key \
     --cacert /home/kuscia/var/certs/ca.crt \
     --header "Token: $(cat /home/kuscia/var/certs/token)" \
     --header 'Content-Type: application/json' \
     -d '{ "grant_domain": "bob",
           "description": {"domaindatagrant":"alice-bob"},
           "domain_id": "alice",
           "domaindata_id": "alice-table"
     }'

这里我不是很理解,kusciaapi-server.crt和kusciaapi-server.key目录下并没有,还有/home/kuscia/var/certs/token也没有,这几个文件是如何生成的,是不是这个原因导致的权限不够

应该是外部k8s的权限,不是容器内的。你的configMap 可能未生效,可以删除pod重新拉起一个看看

magic-hya commented 2 months ago

我重新弄的环境,现在启动还是这个问题

Events:
  Type     Reason            Age   From              Message
  ----     ------            ----  ----              -------
  Warning  FailedScheduling  88s   kuscia-scheduler  0/2 nodes are available: waiting for task resource. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling., can not find related task resource.
  Normal   Scheduled         87s   kuscia-scheduler  Successfully assigned bob/secretflow-task-20240624162727-single-psi-0 to kuscia-lite-bob-7f976b945b-m4j5d

卡住了,不知道问题出在哪一步

magic-hya commented 2 months ago
[root@kuscia-master-546445d874-rbztj kuscia]# kubectl get kj -n cross-domain
NAME                             STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
secretflow-task-20240624162727   33m         5m31s            5m31s               Failed
secretflow-task-20240624165905   84s         1s               1s                  Failed

现在失败了,不知道怎么查询日志

aokaokd commented 2 months ago

如果任务已经下发了,可以看下问题可以看下 /home/kuscia/var/stdout/pods 的 日志: 详情请参阅:作业运行失败

magic-hya commented 2 months ago

如果任务已经下发了,可以看下问题可以看下 /home/kuscia/var/stdout/pods 的 日志: 详情请参阅:作业运行失败

kubectl get pod secretflow-task-20240624165905-single-psi-0 -o yaml -n alice

"/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in <module>
    main()
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 502, in main
    datasource = get_domain_data_source(datasource_stub, datasource_id)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/datamesh.py", line 112, in get_domain_data_source
    ret = stub.QueryDomainDataSource(QueryDomainDataSourceRequest(datasource_id=id))
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1030, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "DNS resolution failed for datamesh:8071: C-ares status is not ARES_SUCCESS qtype=A name=datamesh is_balancer=0: Timeout while contacting DNS servers"
    debug_error_string = "UNKNOWN:DNS resolution failed for datamesh:8071: C-ares status is not ARES_SUCCESS qtype=A name=datamesh is_balancer=0: Timeout while contacting DNS servers {grpc_status:14, created_time:"2024-06-24T09:00:30.484496563+00:00"}"
>

目前看到的报错是这样的

aokaokd commented 2 months ago

其实还有个地方我有疑问的,在配置lite的runtime配置runk时,我使用了默认的kubeconfigFile 然后我也没有配置RBAC,不知道是否会影响到

    # 当 runtime 为 runk 时配置
    runk:
      # 任务调度到指定的机构 k8s namespace 下
      namespace: lite-bob
      # 机构 k8s 集群的 pod dns 配置, 用于解析节点的应用域名, runk 拉起 pod 所使用的 dns 地址,应配置为 kuscia-lite service 的 clusterIP, 此处以 "1.1.1.1" 为例
      dnsServers:
      # - kuscia-dns-lb-server
        - 10.105.200.142
      # k8s 集群的 kubeconfig, 不填默认 serviceaccount; 当前请不填,默认使用 serviceaccount
      kubeconfigFile:

我需要看一下你k8s的svc,你可以到k8s里,输入看下

kubectl  get svc -n alice 
magic-hya commented 2 months ago
[root@kuscia-master-546445d874-rbztj kuscia]# kubectl  get svc -n alice
NAME                                                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE
secretflow-task-20240624165905-single-psi-0-global   ClusterIP   None         <none>        22151/TCP   90m
secretflow-task-20240624165905-single-psi-0-fed      ClusterIP   None         <none>        22150/TCP   90m
secretflow-task-20240624165905-single-psi-0-spu      ClusterIP   None         <none>        22149/TCP   90m

这个应该是正常的

aokaokd commented 2 months ago

[root@kuscia-master-546445d874-rbztj kuscia]# 你应该是在容器里面看的,你需要到容器外面,执行一下下面的命令。同时请您告诉我,您的sf版本

kubectl  get svc -n alice
magic-hya commented 2 months ago

最新调试错误,镜像用的secretflow-lite-anolis8:1.6.0b0

container[secretflow] terminated state reason "Error", message: "rgs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 514, in main
    sf_node_eval_param = preprocess_sf_node_eval_param(
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 271, in preprocess_sf_node_eval_param
    comp_def = get_comp_def(param.domain, param.name, param.version)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 146, in get_comp_def
    key in COMP_MAP
AssertionError: key data_prep/psi:0.0.4 is not in component list [data_prep/union:0.0.1, data_prep/train_test_split:0.0.1, data_prep/psi:0.0.5, ml.train/ss_sgd_train:0.0.1, ml.predict/ss_sgd_predict:0.0.2, data_filter/feature_filter:0.0.1, preprocessing/binary_op:0.0.2, feature/vert_binning:0.0.2, feature/vert_woe_binning:0.0.2, preprocessing/vert_bin_substitution:0.0.1, data_filter/condition_filter:0.0.1, stats/ss_vif:0.0.1, stats/ss_pearsonr:0.0.1, ml.eval/ss_pvalue:0.0.1, stats/table_statistics:0.0.2, stats/groupby_statistics:0.0.3, ml.eval/biclassification_eval:0.0.1, ml.eval/regression_eval:0.0.1, ml.eval/prediction_bias_eval:0.0.1, ml.predict/sgb_predict:0.0.3, ml.train/sgb_train:0.0.3, ml.predict/ss_xgb_predict:0.0.2, ml.train/ss_xgb_train:0.0.1, ml.predict/ss_glm_predict:0.0.2, ml.train/ss_glm_train:0.0.2, ml.train/slnn_train:0.0.1, ml.predict/slnn_predict:0.0.2, preprocessing/onehot_encode:0.0.3, preprocessing/substitution:0.0.2, preprocessing/case_when:0.0.1, preprocessing/fillna:0.0.1, io/read_data:0.0.1, io/write_data:0.0.1, preprocessing/feature_calculate:0.0.1, io/identity:0.0.1, model/model_export:0.0.1]
"
magic-hya commented 2 months ago

[root@kuscia-master-546445d874-rbztj kuscia]# 你应该是在容器里面看的,你需要到容器外面,执行一下下面的命令。同时请您告诉我,您的sf版本

 kubectl  get svc -n alice
[root@k8s-master73 kuscia]#  kubectl  get svc -n lite-alice
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                           AGE
kuscia-lite-alice   ClusterIP   10.98.177.137   <none>        1080/TCP,80/TCP,53/UDP,8082/TCP   54m
[root@k8s-master73 kuscia]#  kubectl  get svc -n lite-bob
NAME              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                           AGE
kuscia-lite-bob   ClusterIP   10.101.181.171   <none>        1080/TCP,80/TCP,53/UDP,8082/TCP   43m
aokaokd commented 2 months ago

10.98.177.137

你需要在你的

最新调试错误,镜像用的secretflow-lite-anolis8:1.6.0b0

container[secretflow] terminated state reason "Error", message: "rgs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 514, in main
    sf_node_eval_param = preprocess_sf_node_eval_param(
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 271, in preprocess_sf_node_eval_param
    comp_def = get_comp_def(param.domain, param.name, param.version)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 146, in get_comp_def
    key in COMP_MAP
AssertionError: key data_prep/psi:0.0.4 is not in component list [data_prep/union:0.0.1, data_prep/train_test_split:0.0.1, data_prep/psi:0.0.5, ml.train/ss_sgd_train:0.0.1, ml.predict/ss_sgd_predict:0.0.2, data_filter/feature_filter:0.0.1, preprocessing/binary_op:0.0.2, feature/vert_binning:0.0.2, feature/vert_woe_binning:0.0.2, preprocessing/vert_bin_substitution:0.0.1, data_filter/condition_filter:0.0.1, stats/ss_vif:0.0.1, stats/ss_pearsonr:0.0.1, ml.eval/ss_pvalue:0.0.1, stats/table_statistics:0.0.2, stats/groupby_statistics:0.0.3, ml.eval/biclassification_eval:0.0.1, ml.eval/regression_eval:0.0.1, ml.eval/prediction_bias_eval:0.0.1, ml.predict/sgb_predict:0.0.3, ml.train/sgb_train:0.0.3, ml.predict/ss_xgb_predict:0.0.2, ml.train/ss_xgb_train:0.0.1, ml.predict/ss_glm_predict:0.0.2, ml.train/ss_glm_train:0.0.2, ml.train/slnn_train:0.0.1, ml.predict/slnn_predict:0.0.2, preprocessing/onehot_encode:0.0.3, preprocessing/substitution:0.0.2, preprocessing/case_when:0.0.1, preprocessing/fillna:0.0.1, io/read_data:0.0.1, io/write_data:0.0.1, preprocessing/feature_calculate:0.0.1, io/identity:0.0.1, model/model_export:0.0.1]
"

可以看到你的 psi 组件不在secretflow-lite-anolis8:1.6.0b0里,

最新调试错误,镜像用的secretflow-lite-anolis8:1.6.0b0

container[secretflow] terminated state reason "Error", message: "rgs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 514, in main
    sf_node_eval_param = preprocess_sf_node_eval_param(
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 271, in preprocess_sf_node_eval_param
    comp_def = get_comp_def(param.domain, param.name, param.version)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 146, in get_comp_def
    key in COMP_MAP
AssertionError: key data_prep/psi:0.0.4 is not in component list [data_prep/union:0.0.1, data_prep/train_test_split:0.0.1, data_prep/psi:0.0.5, ml.train/ss_sgd_train:0.0.1, ml.predict/ss_sgd_predict:0.0.2, data_filter/feature_filter:0.0.1, preprocessing/binary_op:0.0.2, feature/vert_binning:0.0.2, feature/vert_woe_binning:0.0.2, preprocessing/vert_bin_substitution:0.0.1, data_filter/condition_filter:0.0.1, stats/ss_vif:0.0.1, stats/ss_pearsonr:0.0.1, ml.eval/ss_pvalue:0.0.1, stats/table_statistics:0.0.2, stats/groupby_statistics:0.0.3, ml.eval/biclassification_eval:0.0.1, ml.eval/regression_eval:0.0.1, ml.eval/prediction_bias_eval:0.0.1, ml.predict/sgb_predict:0.0.3, ml.train/sgb_train:0.0.3, ml.predict/ss_xgb_predict:0.0.2, ml.train/ss_xgb_train:0.0.1, ml.predict/ss_glm_predict:0.0.2, ml.train/ss_glm_train:0.0.2, ml.train/slnn_train:0.0.1, ml.predict/slnn_predict:0.0.2, preprocessing/onehot_encode:0.0.3, preprocessing/substitution:0.0.2, preprocessing/case_when:0.0.1, preprocessing/fillna:0.0.1, io/read_data:0.0.1, io/write_data:0.0.1, preprocessing/feature_calculate:0.0.1, io/identity:0.0.1, model/model_export:0.0.1]
"

对于这个err,你跑的是哪个脚本,提供下。是按照官方提供的脚本跑的任务吗

magic-hya commented 2 months ago

按照官方提供的脚本跑的,看的是kuscia8.0的文档 执行脚本 scripts/user/create_example_job.sh

aokaokd commented 2 months ago
magic-hya commented 2 months ago
set -e

GREEN='\033[0;32m'
NC='\033[0m'
SUB_HOST_REGEXP="^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$"

USAGE="$(basename "$0") [JOB_EXAMPLE] [JOB_NAME]
JOB_EXAMPLE:
    PSI                 run psi with default-data-source (default).
    NSJAIL_PSI          run psi via nsjail. Set env 'export ALLOW_PRIVILEGED=true' before deployment.
"
JOB_EXAMPLE=$1
JOB_NAME=$2

if [[ ${JOB_EXAMPLE} == "" ]]; then
  JOB_EXAMPLE="PSI"
fi

if [[ ${JOB_EXAMPLE} != "PSI" && ${JOB_EXAMPLE} != "PSI_WITH_DP" && ${JOB_EXAMPLE} != "NSJAIL_PSI" ]]; then
  printf "invalid arguments: JOB_EXAMPLE=%s\n\n%s" "${JOB_EXAMPLE}" "${USAGE}" >&2
  exit 1
fi

ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd -P)
pushd ${ROOT} || exit

SELF_DOMAIN_ID=${NAMESPACE}
if [[ $SELF_DOMAIN_ID == "" ]] ; then
  echo "can not get self domain id, please check NAMESPACE environment"
  exit 1
fi

INITIATOR=alice
if [[ $SELF_DOMAIN_ID == bob ]]; then
  INITIATOR=bob
fi

if [[ $JOB_NAME == "" ]]; then
  JOB_NAME=secretflow-task-$(date +"%Y%m%d%H%M%S")
fi
if [[ ! $JOB_NAME =~ ${SUB_HOST_REGEXP} ]]; then
  echo "job name should match ${SUB_HOST_REGEXP}"
  exit 1
fi

TASK_INPUT_CONFIG=""
if [[ $JOB_EXAMPLE == "PSI_WITH_DP" ]]; then
  TASK_INPUT_CONFIG=$(jq -c . <"scripts/templates/task_input_config.2pc_balanced_psi_dp.json")
else
  TASK_INPUT_CONFIG=$(jq -c . <"scripts/templates/task_input_config.2pc_balanced_psi.json")
fi
ESCAPE_TASK_INPUT_CONFIG=$(echo $TASK_INPUT_CONFIG | sed "s~[\]~\\\&~g")

APP_IMAGE=""
case ${JOB_EXAMPLE} in
"PSI")
  APP_IMAGE="secretflow-image"
  ;;
"PSI_WITH_DP")
  APP_IMAGE="secretflow-image"
  ;;
"NSJAIL_PSI")
  APP_IMAGE="secretflow-nsjail-image"
  ;;
esac
echo -e "With JOB_EXAMPLE=${JOB_EXAMPLE}, job via APP_IMAGE=${APP_IMAGE} creating ..."

template=$(sed "s~{{.JOB_NAME}}~${JOB_NAME}~g;s~{{.TASK_INPUT_CONFIG}}~${ESCAPE_TASK_INPUT_CONFIG}~g;s~{{.Initiator}}~${INITIATOR}~g;s~{{.APP_IMAGE}}~${AP
P_IMAGE}~g" <"scripts/templates/job.2pc_balanced_psi.yaml")

echo "$template" | kubectl apply -f -

echo -e "${GREEN}Job '$JOB_NAME' created successfully. You can use the following command to display job status:
  kubectl get kj -n cross-domain${NC}"

popd || exit
zimu-yuxi commented 2 months ago

ksucia容器中的scripts/templates/task_input_config.2pc_balanced_psi_dp.json,看下version是否是0.0.4,如果是需要修改为0.0.5

magic-hya commented 2 months ago

目前又遇到问题,因为数据要用本地数据,配置了runp,runp使用的kuscia-secretflow镜像,目前lite-alice启动报错 kuscia用的最新的9.0版本,kuscia-secretflow用的latest,私钥都是用kuscia0.7.0b0版本生成私钥,报错日志如下

[root@k8s-master73 kuscia]# kubectl logs kuscia-lite-alice-548c56bd7d-pdj5v -n lite-alice --all-containers
2024-06-25 18:20:41.651 INFO modules/modules.go:204 Start to init all secret backends ...
2024-06-25 18:20:41.651 WARN modules/modules.go:211 Init all secret backend but no provider found, creating default mem type
2024-06-25 18:20:41.651 INFO modules/modules.go:216 Finish Initializing all secret backends
2024-06-25 18:20:41.652 INFO tls/crypt.go:313 Generate cert with key, subject[alice]
2024-06-25 18:20:41.655 INFO tls/crypt.go:313 Generate cert with key, subject[alice]
2024-06-25 18:20:41.658 INFO modules/coredns.go:180 Start preparing coredns resolv.conf, root dir /home/kuscia/
2024-06-25 18:20:41.658 INFO modules/coredns.go:203 Finish preparing coredns resolv.conf
.:53
2024-06-25 18:20:41.659 INFO modules/coredns.go:169 coredns is ready
2024-06-25 18:20:41.660 INFO xds/xds.go:154 Management server listening on 10001
2024-06-25 18:20:41.660 INFO supervisor/supervisor.go:72 [envoy] start and watch subprocess
2024-06-25 18:20:41.660 INFO supervisor/supervisor.go:79 [envoy] try to start new process
2024-06-25 18:20:41.672 INFO commands/root.go:60 Start xds success
2024-06-25 18:20:41.672 INFO xds/cluster_config.go:275 disable keep-alive for cluster:service-masterproxy
2024-06-25 18:20:41.672 INFO xds/xds.go:426 Add cluster:service-masterproxy
2024-06-25 18:20:41.672 INFO clusters/master.go:69 add Master cluster:masterproxy
2024-06-25 18:20:42.660 INFO modules/transport.go:124 transport is ready
2024-06-25 18:20:42.661 INFO modules/envoy.go:183 Envoy is ready
2024-06-25 18:20:43.679 INFO clusters/master.go:309 Get master gateway namespace: kuscia-system
2024-06-25 18:20:44.725 INFO xds/cluster_config.go:275 enable keep-alive for cluster:service-masterproxy
2024-06-25 18:20:44.725 INFO xds/xds.go:426 Add cluster:service-masterproxy
2024-06-25 18:20:45.747 INFO commands/root.go:178 Check MasterProxy ready
2024-06-25 18:20:45.747 INFO commands/root.go:94 Add master clusters success
2024-06-25 18:20:45.747 INFO xds/xds.go:426 Add cluster:service-transport
2024-06-25 18:20:45.747 INFO clusters/interconn.go:70 Add Transport Cluster success
2024-06-25 18:20:45.747 INFO commands/root.go:106 Add interconn clusters success
2024-06-25 18:20:45.749 INFO controller/gateway.go:91 Starting Gateway controller
2024-06-25 18:20:45.749 INFO controller/endpoints.go:122 Waiting for informer caches to sync
2024-06-25 18:20:45.749 INFO nlog/nlog.go:77 I0625 18:20:45.749974       7 shared_informer.go:270] Waiting for caches to sync for endpoints
2024-06-25 18:20:45.750 INFO commands/root.go:166 Gateway running
2024-06-25 18:20:45.750 INFO controller/domain_route.go:184 Starting DomainRoute controller
2024-06-25 18:20:45.750 INFO modules/domainroute.go:136 domainroute is ready
2024-06-25 18:20:45.750 INFO controller/domain_route.go:187 Waiting for informer caches to sync
2024-06-25 18:20:45.750 INFO nlog/nlog.go:77 I0625 18:20:45.750978       7 shared_informer.go:270] Waiting for caches to sync for pod
2024-06-25 18:20:45.751 INFO nlog/nlog.go:77 I0625 18:20:45.751006       7 shared_informer.go:270] Waiting for caches to sync for endpoints
2024-06-25 18:20:45.751 INFO commands/root.go:47 Run root command, Namespace=alice
2024-06-25 18:20:45.751 INFO plugin/plugin.go:71 Init plugin hook:cert-issuance succeed
2024-06-25 18:20:45.751 INFO plugin/plugin.go:71 Init plugin hook:config-render succeed
2024-06-25 18:20:45.758 INFO node/capacity_manager.go:51 Capacity Manager, cfg:&{4 4Gi 500 100Gi}, rootDir: /home/kuscia/, localCapacity:true
2024-06-25 18:20:45.759 INFO framework/node_controller.go:165 Configure node kuscia-lite-alice-548c56bd7d-pdj5v
2024-06-25 18:20:45.777 INFO node/generic_node.go:70 Configure generic node "kuscia-lite-alice-548c56bd7d-pdj5v" successfully
2024-06-25 18:20:45.825 INFO framework/node_controller.go:567 Created new lease, name=kuscia-lite-alice-548c56bd7d-pdj5v
2024-06-25 18:20:45.825 INFO framework/node_controller.go:202 Node controller started
2024-06-25 18:20:45.826 INFO process/process.go:67 Process runtime initialized
2024-06-25 18:20:45.826 INFO kuberuntime/kuberuntime_manager.go:190 Container runtime initialized, containerRuntime=runp, version=6994ca0, apiVersion=6994ca0
2024-06-25 18:20:45.826 INFO source/apiserver.go:82 Start running apiserver source
2024-06-25 18:20:45.850 INFO nlog/nlog.go:77 I0625 18:20:45.850555       7 shared_informer.go:277] Caches are synced for endpoints
2024-06-25 18:20:45.850 INFO controller/endpoints.go:127 Starting endpoints Controller
2024-06-25 18:20:45.850 INFO controller/domain_route.go:194 Starting workers
2024-06-25 18:20:45.850 INFO controller/domain_route.go:199 Started workers
2024-06-25 18:20:45.850 INFO controller/domain_route.go:287 DomainRoute alice/alice-kuscia-system starts handshake, the last revision is 0
2024-06-25 18:20:45.851 INFO nlog/nlog.go:77 I0625 18:20:45.851106       7 shared_informer.go:277] Caches are synced for pod
2024-06-25 18:20:45.851 INFO coredns/controller.go:52 Starting pod controller, namespace: alice
2024-06-25 18:20:45.851 INFO nlog/nlog.go:77 I0625 18:20:45.851261       7 shared_informer.go:277] Caches are synced for endpoints
2024-06-25 18:20:45.851 INFO coredns/controller.go:42 Starting endpoint controller, namespace: alice
2024-06-25 18:20:45.851 INFO nlog/nlog.go:77 E0625 18:20:45.851335       7 runtime.go:79] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 474 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x513a3a0?, 0xc0025d63a8})
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0027cace0?})
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/runtime/runtime.go:49 +0x75
panic({0x513a3a0, 0xc0025d63a8})
        /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).sourceInitiateHandShake(0xc0025b0200, 0xc001e40a80)
        /root/project/pkg/gateway/controller/handshake.go:319 +0x1b0f
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).syncHandler.func1(0xc001e40a80, {0xc002620dc0?, 0x19?}, 0x4bad6a0?)
        /root/project/pkg/gateway/controller/domain_route.go:288 +0xd4
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).syncHandler(0xc0025b0200, {0x5d3db80, 0xc000228120}, {0xc002620dc0, 0x19})
        /root/project/pkg/gateway/controller/domain_route.go:289 +0x50e
github.com/secretflow/kuscia/pkg/utils/queue.HandleQueueItem.func1({0x4bad6a0?, 0xc0027cace0})
        /root/project/pkg/utils/queue/queue.go:106 +0x227
github.com/secretflow/kuscia/pkg/utils/queue.HandleQueueItem({0x5d3db80, 0xc000228120}, {0x53ed29b, 0x12}, {0x5d568e0, 0xc00055b140}, 0xc002213e58, 0x10)
        /root/project/pkg/utils/queue/queue.go:131 +0x1b1
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).runWorker(0xc0025b0200)
        /root/project/pkg/gateway/controller/domain_route.go:243 +0x6f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x5d08f00, 0xc002746a50}, 0x1, 0xc000fa60c0)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:92 +0x25
created by github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).Run
        /root/project/pkg/gateway/controller/domain_route.go:196 +0x2fe
2024-06-25 18:20:45.851 INFO nlog/nlog.go:77 E0625 18:20:45.851335       7 runtime.go:79] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 474 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x513a3a0?, 0xc0025d63a8})
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0027cace0?})
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/runtime/runtime.go:49 +0x75
panic({0x513a3a0, 0xc0025d63a8})
        /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).sourceInitiateHandShake(0xc0025b0200, 0xc001e40a80)
        /root/project/pkg/gateway/controller/handshake.go:319 +0x1b0f
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).syncHandler.func1(0xc001e40a80, {0xc002620dc0?, 0x19?}, 0x4bad6a0?)
        /root/project/pkg/gateway/controller/domain_route.go:288 +0xd4
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).syncHandler(0xc0025b0200, {0x5d3db80, 0xc000228120}, {0xc002620dc0, 0x19})
        /root/project/pkg/gateway/controller/domain_route.go:289 +0x50e
github.com/secretflow/kuscia/pkg/utils/queue.HandleQueueItem.func1({0x4bad6a0?, 0xc0027cace0})
        /root/project/pkg/utils/queue/queue.go:106 +0x227
github.com/secretflow/kuscia/pkg/utils/queue.HandleQueueItem({0x5d3db80, 0xc000228120}, {0x53ed29b, 0x12}, {0x5d568e0, 0xc00055b140}, 0xc002213e58, 0x10)
        /root/project/pkg/utils/queue/queue.go:131 +0x1b1
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).runWorker(0xc0025b0200)
        /root/project/pkg/gateway/controller/domain_route.go:243 +0x6f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x5d08f00, 0xc002746a50}, 0x1, 0xc000fa60c0)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:92 +0x25
created by github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).Run
        /root/project/pkg/gateway/controller/domain_route.go:196 +0x2fe
2024-06-25 18:20:45.851 INFO nlog/nlog.go:77 E0625 18:20:45.851335       7 runtime.go:79] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 474 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x513a3a0?, 0xc0025d63a8})
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0027cace0?})
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/runtime/runtime.go:49 +0x75
panic({0x513a3a0, 0xc0025d63a8})
        /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).sourceInitiateHandShake(0xc0025b0200, 0xc001e40a80)
        /root/project/pkg/gateway/controller/handshake.go:319 +0x1b0f
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).syncHandler.func1(0xc001e40a80, {0xc002620dc0?, 0x19?}, 0x4bad6a0?)
        /root/project/pkg/gateway/controller/domain_route.go:288 +0xd4
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).syncHandler(0xc0025b0200, {0x5d3db80, 0xc000228120}, {0xc002620dc0, 0x19})
        /root/project/pkg/gateway/controller/domain_route.go:289 +0x50e
E0625 18:20:45.851335       7 runtime.go:79] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 474 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x513a3a0?, 0xc0025d63a8})
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0027cace0?})
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/runtime/runtime.go:49 +0x75
panic({0x513a3a0, 0xc0025d63a8})
        /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).sourceInitiateHandShake(0xc0025b0200, 0xc001e40a80)
        /root/project/pkg/gateway/controller/handshake.go:319 +0x1b0f
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).syncHandler.func1(0xc001e40a80, {0xc002620dc0?, 0x19?}, 0x4bad6a0?)
        /root/project/pkg/gateway/controller/domain_route.go:288 +0xd4
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).syncHandler(0xc0025b0200, {0x5d3db80, 0xc000228120}, {0xc002620dc0, 0x19})
        /root/project/pkg/gateway/controller/domain_route.go:289 +0x50e
github.com/secretflow/kuscia/pkg/utils/queue.HandleQueueItem.func1({0x4bad6a0?, 0xc0027cace0})
        /root/project/pkg/utils/queue/queue.go:106 +0x227
github.com/secretflow/kuscia/pkg/utils/queue.HandleQueueItem({0x5d3db80, 0xc000228120}, {0x53ed29b, 0x12}, {0x5d568e0, 0xc00055b140}, 0xc002213e58, 0x10)
        /root/project/pkg/utils/queue/queue.go:131 +0x1b1
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).runWorker(0xc0025b0200)
github.com/secretflow/kuscia/pkg/utils/queue.HandleQueueItem.func1({0x4bad6a0?, 0xc0027cace0})
        /root/project/pkg/utils/queue/queue.go:106 +0x227
github.com/secretflow/kuscia/pkg/utils/queue.HandleQueueItem({0x5d3db80, 0xc000228120}, {0x53ed29b, 0x12}, {0x5d568e0, 0xc00055b140}, 0xc002213e58, 0x10)
        /root/project/pkg/utils/queue/queue.go:131 +0x1b1
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).runWorker(0xc0025b0200)
        /root/project/pkg/gateway/controller/domain_route.go:243 +0x6f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x5d08f00, 0xc002746a50}, 0x1, 0xc000fa60c0)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:92 +0x25
created by github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).Run
        /root/project/pkg/gateway/controller/domain_route.go:196 +0x2fe
        /root/project/pkg/gateway/controller/domain_route.go:243 +0x6f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x5d08f00, 0xc002746a50}, 0x1, 0xc000fa60c0)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:92 +0x25
created by github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).Run
        /root/project/pkg/gateway/controller/domain_route.go:196 +0x2fe
panic: runtime error: index out of range [0] with length 0 [recovered]
        panic: runtime error: index out of range [0] with length 0

goroutine 474 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0027cace0?})
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/runtime/runtime.go:56 +0xd7
panic({0x513a3a0, 0xc0025d63a8})
        /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).sourceInitiateHandShake(0xc0025b0200, 0xc001e40a80)
        /root/project/pkg/gateway/controller/handshake.go:319 +0x1b0f
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).syncHandler.func1(0xc001e40a80, {0xc002620dc0?, 0x19?}, 0x4bad6a0?)
        /root/project/pkg/gateway/controller/domain_route.go:288 +0xd4
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).syncHandler(0xc0025b0200, {0x5d3db80, 0xc000228120}, {0xc002620dc0, 0x19})
        /root/project/pkg/gateway/controller/domain_route.go:289 +0x50e
github.com/secretflow/kuscia/pkg/utils/queue.HandleQueueItem.func1({0x4bad6a0?, 0xc0027cace0})
        /root/project/pkg/utils/queue/queue.go:106 +0x227
github.com/secretflow/kuscia/pkg/utils/queue.HandleQueueItem({0x5d3db80, 0xc000228120}, {0x53ed29b, 0x12}, {0x5d568e0, 0xc00055b140}, 0xc002213e58, 0x10)
        /root/project/pkg/utils/queue/queue.go:131 +0x1b1
github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).runWorker(0xc0025b0200)
        /root/project/pkg/gateway/controller/domain_route.go:243 +0x6f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x5d08f00, 0xc002746a50}, 0x1, 0xc000fa60c0)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
        /root/gopath/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:92 +0x25
created by github.com/secretflow/kuscia/pkg/gateway/controller.(*DomainRouteController).Run
        /root/project/pkg/gateway/controller/domain_route.go:196 +0x2fe
aokaokd commented 2 months ago

这个看上去是一个数组下标越界,我怀疑是版本问题,事实上kuscia 9.0目前刚刚发布,你可以使用 kuscia 8.0搭配 secretflow1.6.1b0 这个版本使用。版本对应关系如下: 版本

magic-hya commented 2 months ago

有没有国内的镜像仓库同步了镜像,secretflow/anolis8-python:3.10.13这个镜像下不下来

aokaokd commented 2 months ago

阿里云的镜像仓库可以满足的,可以试一下

magic-hya commented 2 months ago

使用阿里云海外机器构建失败:

#2 ERROR: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/secretflow/anolis8-python/manifests/sha256:9d5a45684ebe73c55c9b59dcd38512fc44e5374faca670562d28da4d1304cafb: 429 
Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

使用镜像加速器依然无法下载镜像

aokaokd commented 2 months ago

你的

使用阿里云海外机器构建失败:

#2 ERROR: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/secretflow/anolis8-python/manifests/sha256:9d5a45684ebe73c55c9b59dcd38512fc44e5374faca670562d28da4d1304cafb: 429 
Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

使用镜像加速器依然无法下载镜像

事实上没有secretflow/anolis8-python这个镜像,镜像地址可以参考:

secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.8.0b0
secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-anolis8
secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/serving-anolis8:0.3.1b0
magic-hya commented 2 months ago

官方文档提到的证书文件目录,我打开目下没有相关证书,该证书是需要自己生成吗,有没有生成脚本 参考文档: https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.8.0b0/tutorial/run_sf_job_with_api_cn

Kuscia API 使用双向 HTTPS,所以需要配置你的客户端库的双向 HTTPS 配置。 中心化组网模式 证书文件在 ${USER}-kuscia-master 节点的/home/kuscia/var/certs/目录下: kusciaapi-server.key kusciaapi-server.crt token文件都没有

目录下只有4个文件:

ca.crt
ca.key
domain.crt
domain.key
zimu-yuxi commented 2 months ago

官方文档提到的证书文件目录,我打开目下没有相关证书,该证书是需要自己生成吗,有没有生成脚本 参考文档: https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.8.0b0/tutorial/run_sf_job_with_api_cn

Kuscia API 使用双向 HTTPS,所以需要配置你的客户端库的双向 HTTPS 配置。 中心化组网模式 证书文件在 ${USER}-kuscia-master 节点的/home/kuscia/var/certs/目录下: kusciaapi-server.key kusciaapi-server.crt token文件都没有

目录下只有4个文件:

ca.crt
ca.key
domain.crt
domain.key

官方文档提到的证书文件目录,我打开目下没有相关证书,该证书是需要自己生成吗,有没有生成脚本 参考文档: https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.8.0b0/tutorial/run_sf_job_with_api_cn

Kuscia API 使用双向 HTTPS,所以需要配置你的客户端库的双向 HTTPS 配置。 中心化组网模式 证书文件在 ${USER}-kuscia-master 节点的/home/kuscia/var/certs/目录下: kusciaapi-server.key kusciaapi-server.crt token文件都没有

目录下只有4个文件:

ca.crt
ca.key
domain.crt
domain.key

镜像版本提供一下

magic-hya commented 2 months ago

master镜像: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.8.0b0 lite镜像: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia-secretflow:0.8.0b0

aokaokd commented 2 months ago

官方文档提到的证书文件目录,我打开目下没有相关证书,该证书是需要自己生成吗,有没有生成脚本 参考文档: https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.8.0b0/tutorial/run_sf_job_with_api_cn

Kuscia API 使用双向 HTTPS,所以需要配置你的客户端库的双向 HTTPS 配置。 中心化组网模式 证书文件在 ${USER}-kuscia-master 节点的/home/kuscia/var/certs/目录下: kusciaapi-server.key kusciaapi-server.crt token文件都没有

目录下只有4个文件:

ca.crt
ca.key
domain.crt
domain.key

问题1:

如果部署时设置了 TLS 会生成 访问kusciaAPI的 token

问题2:

目前 kuscia API 客户端的证书现在不会默认生成(kusciaApiServer.crt、kusciaApiServer.key等),需要执行这个脚本

PlanetAndMars commented 2 months ago

官方文档提到的证书文件目录,我打开目下没有相关证书,该证书是需要自己生成吗,有没有生成脚本 参考文档: https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.8.0b0/tutorial/run_sf_job_with_api_cn Kuscia API 使用双向 HTTPS,所以需要配置你的客户端库的双向 HTTPS 配置。 中心化组网模式 证书文件在 ${USER}-kuscia-master 节点的/home/kuscia/var/certs/目录下: kusciaapi-server.key kusciaapi-server.crt token文件都没有 目录下只有4个文件:

ca.crt
ca.key
domain.crt
domain.key

问题1:

如果部署时设置了 TLS 会生成 访问kusciaAPI的 token

问题2:

目前 kuscia API 客户端的证书现在不会默认生成(kusciaApiServer.crt、kusciaApiServer.key等),需要执行这个脚本

如果部署的时候,configmap.yaml里面选择的是protocol: NOTLS,是否可以使用http的方式请求kuscia API ?先忽略证书这些东西

aokaokd commented 2 months ago

官方文档提到的证书文件目录,我打开目下没有相关证书,该证书是需要自己生成吗,有没有生成脚本 参考文档: https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.8.0b0/tutorial/run_sf_job_with_api_cn Kuscia API 使用双向 HTTPS,所以需要配置你的客户端库的双向 HTTPS 配置。 中心化组网模式 证书文件在 ${USER}-kuscia-master 节点的/home/kuscia/var/certs/目录下: kusciaapi-server.key kusciaapi-server.crt token文件都没有 目录下只有4个文件:

ca.crt
ca.key
domain.crt
domain.key

问题1:

如果部署时设置了 TLS 会生成 访问kusciaAPI的 token

问题2:

目前 kuscia API 客户端的证书现在不会默认生成(kusciaApiServer.crt、kusciaApiServer.key等),需要执行这个脚本

如果部署的时候,configmap.yaml里面选择的是protocol: NOTLS,是否可以使用http的方式请求kuscia API ?先忽略证书这些东西

是的,你可以这样操作

magic-hya commented 2 months ago

在master节点使用脚本生成了证书 下一步是准备数据,我需要创建2方的本地数据,示例中创建的是oss数据,指南也没有创建本地数据的方法。

目前跑通流程,问题卡在runp模式创建数据并发起job任务

aokaokd commented 2 months ago

你可以先创建一个domainDataSource、再创建一个DomainData、再创建一个DomainDataGrant 来绑定数据所有者,可以参考 官网: https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.8.0b0/reference/concepts/domaindatagrant_cn

magic-hya commented 2 months ago

官方文档提到的证书文件目录,我打开目下没有相关证书,该证书是需要自己生成吗,有没有生成脚本 参考文档: https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.8.0b0/tutorial/run_sf_job_with_api_cn Kuscia API 使用双向 HTTPS,所以需要配置你的客户端库的双向 HTTPS 配置。 中心化组网模式 证书文件在 ${USER}-kuscia-master 节点的/home/kuscia/var/certs/目录下: kusciaapi-server.key kusciaapi-server.crt token文件都没有 目录下只有4个文件:

ca.crt
ca.key
domain.crt
domain.key

问题1:

如果部署时设置了 TLS 会生成 访问kusciaAPI的 token

问题2:

目前 kuscia API 客户端的证书现在不会默认生成(kusciaApiServer.crt、kusciaApiServer.key等),需要执行这个脚本

如果部署的时候,configmap.yaml里面选择的是protocol: NOTLS,是否可以使用http的方式请求kuscia API ?先忽略证书这些东西

是的,你可以这样操作

只有client的脚本,server的证书是不是改一下脚本

magic-hya commented 2 months ago

master开启TLS后,目录下自动生成了证书,但是部署alice节点时报错,pod无法启动,报错如下

2024-07-11 18:34:53.660 INFO modules/coredns.go:169 coredns is ready
2024-07-11 18:34:53.662 INFO supervisor/supervisor.go:72 [envoy] start and watch subprocess
2024-07-11 18:34:53.662 INFO supervisor/supervisor.go:79 [envoy] try to start new process
2024-07-11 18:34:53.821 INFO xds/xds.go:160 Management server listening on 10001
2024-07-11 18:34:53.828 INFO commands/root.go:62 Start xds success
2024-07-11 18:34:53.830 ERROR modules/domainroute.go:135 [PROBE] failed to probe master endpoint http://kuscia-master.kuscia-master.svc.cluster.local:1080, detail-> sending request error: Get "http://kuscia-master.kuscia-master.svc.cluster.local:1080": EOF
2024-07-11 18:34:53.830 ERROR modules/domainroute.go:140 domain route wait ready failed with error: context canceled
2024-07-11 18:34:53.830 ERROR modules/transport.go:123 context canceled
2024-07-11 18:34:53.830 ERROR modules/envoy.go:204 context canceled
2024-07-11 18:34:53.830 FATAL http/server.go:88 Transport server has been canceled
aokaokd commented 2 months ago

看上去是访问不到master节点,你的lite节点configmap里面的masterEndpoint修改下看看,记得修改configmap后,需要启动新的pod

magic-hya commented 2 months ago

我在lite里面写的还是 masterEndpoint: http://kuscia-master.kuscia-master.svc.cluster.local:1080 master换成TLS后,访问不通,不知道还需要配置什么地方