secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
70 stars 49 forks source link

K8S点对点Runp模式运行测试作业时,任务一直pending #371

Open PlanetAndMars opened 1 month ago

PlanetAndMars commented 1 month ago

Issue Type

Others

Search for existing issues similar to yours

No

Kuscia Version

latest

Link to Relevant Documentation

No response

Question Details

通过下面deployment.yaml拉起pod,使用的镜像是latest

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kuscia-autonomy-alice
  namespace: autonomy-alice
spec:
  replicas: 2
  selector:
    matchLabels:
      app: kuscia-autonomy-alice
  template:
    metadata:
      labels:
        app: kuscia-autonomy-alice
    spec:
      containers:
        - command:
            - tini
            - --
            - kuscia
            - start
            - -c
            - etc/conf/kuscia.yaml
          env:
            - name: REGISTRY_ENDPOINT
              value: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow
            - name: NAMESPACE
              value: alice
            - name: TZ
              value: Asia/Shanghai
          image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia-secretflow:latest
          imagePullPolicy: Always
          name: alice
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /home/kuscia/var/tmp
              name: kuscia-var-tmp
            - mountPath: /home/kuscia/etc/conf/kuscia.yaml
              name: kuscia-config
              subPath: kuscia.yaml
          workingDir: /home/kuscia
      automountServiceAccountToken: false
      volumes:
        - emptyDir: {}
          name: kuscia-var-tmp
        - configMap:
            defaultMode: 420
            name: kuscia-autonomy-alice-cm
          name: kuscia-config

进入Pod后通过下面AppImage.yaml,创建appimage,使用的镜像是1.7.0b0:

apiVersion: kuscia.secretflow/v1alpha1
kind: AppImage
metadata:
  name: secretflow-image
spec:
  configTemplates:
    task-config.conf: |
      {
        "task_id": "{{.TASK_ID}}",
        "task_input_config": "{{.TASK_INPUT_CONFIG}}",
        "task_cluster_def": "{{.TASK_CLUSTER_DEFINE}}",
        "allocated_ports": "{{.ALLOCATED_PORTS}}"
      }
  deployTemplates:
  - name: secretflow
    replicas: 1
    spec:
      containers:
      - args:
        - -c
        - python -m secretflow.kuscia.entry ./kuscia/task-config.conf
        command:
        - sh
        configVolumeMounts:
        - mountPath: /root/kuscia/task-config.conf
          subPath: task-config.conf
        name: secretflow
        ports:
        - name: spu
          port: 20000
          protocol: GRPC
          scope: Cluster
        - name: fed
          port: 20001
          protocol: GRPC
          scope: Cluster
        - name: global
          port: 20002
          protocol: GRPC
          scope: Domain
        - name: node-manager
          port: 20003
          protocol: GRPC
          scope: Local
        - name: object-manager
          port: 20004
          protocol: GRPC
          scope: Local
        - name: client-server
          port: 20005
          protocol: GRPC
          scope: Local
        workingDir: /root
      restartPolicy: Never
  image:
    id: abc
    name: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8
    sign: abc
    tag: 1.7.0b0

执行脚本 scripts/user/create_example_job.sh 运行测试任务,任务一直pending

查询任务详情:

apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  creationTimestamp: "2024-07-09T06:16:37Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/interconn-protocol-type: kuscia
    kuscia.secretflow/job-id: secretflow-task-20240709141636
    kuscia.secretflow/self-cluster-as-initiator: "true"
    kuscia.secretflow/task-alias: single-psi
  name: secretflow-task-20240709141636-single-psi
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: secretflow-task-20240709141636
    uid: aaa2b5b8-c4c4-4020-88c7-22223ec8df4f
  resourceVersion: "5547"
  uid: 6078ddb9-96bb-48f4-bee9-c4ec0316ce49
spec:
  initiator: alice
  parties:
  - appImageRef: secretflow-image
    domainID: alice
    template:
      spec: {}
  - appImageRef: secretflow-image
    domainID: bob
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: '{"sf_datasource_config":{"alice":{"id":"default-data-source"},"bob":{"id":"default-data-source"}},"sf_cluster_desc":{"parties":["alice","bob"],"devices":[{"name":"spu","type":"spu","parties":["alice","bob"],"config":"{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"},{"name":"heu","type":"heu","parties":["alice","bob"],"config":"{\"mode\":
    \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"}],"ray_fed_config":{"cross_silo_comm_backend":"brpc_link"}},"sf_node_eval_param":{"domain":"preprocessing","name":"psi","version":"0.0.1","attr_paths":["input/receiver_input/key","input/sender_input/key","protocol","precheck_input","bucket_size","curve_type"],"attrs":[{"ss":["id1"]},{"ss":["id2"]},{"s":"ECDH_PSI_2PC"},{"b":true},{"i64":"1048576"},{"s":"CURVE_FOURQ"}]},"sf_input_ids":["alice-table","bob-table"],"sf_output_ids":["psi-output"],"sf_output_uris":["psi-output.csv"]}'
status:
  allocatedPorts:
  - domainID: alice
    namedPort:
      secretflow-task-20240709141636-single-psi-0/client-server: 24276
      secretflow-task-20240709141636-single-psi-0/fed: 24272
      secretflow-task-20240709141636-single-psi-0/global: 24273
      secretflow-task-20240709141636-single-psi-0/node-manager: 24274
      secretflow-task-20240709141636-single-psi-0/object-manager: 24275
      secretflow-task-20240709141636-single-psi-0/spu: 24277
  - domainID: bob
    namedPort:
      secretflow-task-20240709141636-single-psi-0/client-server: 31964
      secretflow-task-20240709141636-single-psi-0/fed: 31966
      secretflow-task-20240709141636-single-psi-0/global: 31967
      secretflow-task-20240709141636-single-psi-0/node-manager: 31968
      secretflow-task-20240709141636-single-psi-0/object-manager: 31963
      secretflow-task-20240709141636-single-psi-0/spu: 31965
  conditions:
  - lastTransitionTime: "2024-07-09T06:16:37Z"
    status: "True"
    type: ResourceCreated
  lastReconcileTime: "2024-07-09T06:26:41Z"
  phase: Pending
  podStatuses:
    alice/secretflow-task-20240709141636-single-psi-0:
      createTime: "2024-07-09T06:16:37Z"
      message: 'container[secretflow] waiting state reason: "ImageInspectError", message:
        "Failed to inspect image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\":
        failed to get image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        manifest, detail-> image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        not exist in local repository"'
      namespace: alice
      nodeName: kuscia-autonomy-alice-66cfbb85b-65kdf
      podName: secretflow-task-20240709141636-single-psi-0
      podPhase: Pending
      reason: ImageInspectError
      startTime: "2024-07-09T06:16:40Z"
    bob/secretflow-task-20240709141636-single-psi-0:
      createTime: "2024-07-09T06:16:38Z"
      message: 'container[secretflow] waiting state reason: "ImageInspectError", message:
        "Failed to inspect image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\":
        failed to get image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        manifest, detail-> image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        not exist in local repository"'
      namespace: bob
      podName: secretflow-task-20240709141636-single-psi-0
      podPhase: Pending
      reason: ImageInspectError
      startTime: "2024-07-09T06:16:40Z"
  serviceStatuses:
    alice/secretflow-task-20240709141636-single-psi-0-fed:
      createTime: "2024-07-09T06:16:38Z"
      namespace: alice
      portName: fed
      portNumber: 24272
      readyTime: "2024-07-09T06:16:41Z"
      scope: Cluster
      serviceName: secretflow-task-20240709141636-single-psi-0-fed
    alice/secretflow-task-20240709141636-single-psi-0-global:
      createTime: "2024-07-09T06:16:38Z"
      namespace: alice
      portName: global
      portNumber: 24273
      readyTime: "2024-07-09T06:16:41Z"
      scope: Domain
      serviceName: secretflow-task-20240709141636-single-psi-0-global
    alice/secretflow-task-20240709141636-single-psi-0-spu:
      createTime: "2024-07-09T06:16:37Z"
      namespace: alice
      portName: spu
      portNumber: 24277
      readyTime: "2024-07-09T06:16:41Z"
      scope: Cluster
      serviceName: secretflow-task-20240709141636-single-psi-0-spu
    bob/secretflow-task-20240709141636-single-psi-0-fed:
      createTime: "2024-07-09T06:16:38Z"
      namespace: bob
      portName: fed
      portNumber: 31966
      scope: Cluster
      serviceName: secretflow-task-20240709141636-single-psi-0-fed
    bob/secretflow-task-20240709141636-single-psi-0-global:
      createTime: "2024-07-09T06:16:38Z"
      namespace: bob
      portName: global
      portNumber: 31967
      scope: Domain
      serviceName: secretflow-task-20240709141636-single-psi-0-global
    bob/secretflow-task-20240709141636-single-psi-0-spu:
      createTime: "2024-07-09T06:16:38Z"
      namespace: bob
      portName: spu
      portNumber: 31965
      scope: Cluster
      serviceName: secretflow-task-20240709141636-single-psi-0-spu
  startTime: "2024-07-09T06:16:37Z"

报错为镜像找不到,将镜像换成 1.6.0b0也是同样的报错。 但是在宿主机上1.7.0b0镜像存在。 image

image

求解答。

aokaokd commented 1 month ago

用新版本再重试下看看呢

PlanetAndMars commented 1 month ago

用新版本再重试下看看呢

我是用的就是latest版本呀,不知道您说的新版本什么意思呢

aokaokd commented 1 month ago

好的,看上去是kuscia里去拉secretflow镜像出现的问题,我们这边确认一下

aokaokd commented 1 month ago

刚才阿里云镜像仓库有抖动。这个错误是因为没有从远程仓库(secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow)拉取到本地导致。现在应该是恢复了。现在应该可以了

PlanetAndMars commented 1 month ago

刚才阿里云镜像仓库有抖动。这个错误是因为没有从远程仓库(secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow)拉取到本地导致。现在应该是恢复了。现在应该可以了

刚试了下,还是一样的报错呀。 image

如果是用0.8.0b0 和 1.6.0b0就没问题 用latest和1.7.0b0,1.6.0b0都不行

aokaokd commented 1 month ago

您部署完以后手动到pod 执行下k8s apply试下呢

PlanetAndMars commented 1 month ago

您部署完以后手动到pod 执行下k8s apply试下呢

kubectl apply -f AppImage.yaml 么?执行了这个的

magic-hya commented 1 month ago

我也遇到这个问题,我设置的私有仓库,仍然是从本地拉取镜像,同样的错误,怎么解决的

zimu-yuxi commented 1 month ago

1.私有仓库,这个配置暂不可用 2.k8s runp不会去拉远端的镜像的。确认下你的depolyment里用的是kuscia-secretflow镜像。

magic-hya commented 1 month ago

是runp,如果不能拉远端的镜像,如何能够引入appimage的镜像

zimu-yuxi commented 1 month ago

是runp,如果不能拉远端的镜像,如何能够引入appimage的镜像

kuscia-secretflow这个镜像在构建的时候,已经将sf镜像打进去了。所以确认下depolyment里用的是kuscia-secretflow镜像

magic-hya commented 1 month ago

image 这个镜像就是kuscia-secretflow,但是后面不是要指定个appimage镜像执行任务,现在是这个执行任务的时候回去拉取appimage里面的镜像

github-actions[bot] commented 1 week ago

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.