secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
73 stars 55 forks source link

使用kuscia v0.11.0b0部署,appimage远程拉取镜像时报错 #411

Closed magic-hya closed 4 weeks ago

magic-hya commented 2 months ago

Issue Type

Running

Search for existing issues similar to yours

Yes

OS Platform and Distribution

CentOS Linux 7

Kuscia Version

k8s中心化部署kuscia v0.11.0b0

Deployment

k8s

deployment Version

k8s 1.22.2

App Running type

secretflow

App Running version

secretflow/secretflow-lite-anolis8:1.7.0b0

Configuration file used to run kuscia.

lite-alice.yaml部分配置

    runtime: runp
    # agent 镜像配置
    image:
      pullPolicy: remote
      defaultRegistry: "harbor"
      registries:
        - name: "harbor"
          endpoint: "harbor.com/secretflow"
          username: "admin"
          password: "Harbor12345"

### What happend and What you expected to happen.

```shell
执行作业脚本
scripts/user/create_example_job.sh
任务卡住
日志显示镜像拉取不下来

Kuscia log output.

任务卡住

$ kubectl get kt
NAME                                        STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
secretflow-task-20240830112925-single-psi   2d23h                        2m22s               Pending

pod显示镜像出错

$ kubectl get pod -n alice
NAME                                          READY   STATUS              RESTARTS   AGE
secretflow-task-20240830112925-single-psi-0   0/1     ImageInspectError   0          2d23h

log显示访问出错

$ kubectl logs secretflow-task-20240830112925-single-psi-0 -n alice
Error from server: Get "https://192.168.30.158:10250/containerLogs/alice/secretflow-task-20240830112925-single-psi-0/secretflow": proxy error from 0.0.0.0:6443 while dialing 192.168.30.158:10250, code 502: 502 Bad Gateway

describe显示镜像不存在本地仓库中,远程拉取模式没生效

$  kubectl describe  kt secretflow-task-20240830112925-single-psi -n alice
Message:      container[secretflow] waiting state reason: "ImageInspectError", message: "Failed to inspect image \"harbor.com/secretflow/secretflow-lite-anolis8:1.7.0b0\": failed to get image \"harbor.com/secretflow/secretflow-lite-anolis8:1.7.0b0\" manifest, detail-> image \"harbor.com/secretflow/secretflow-lite-anolis8:1.7.0b0\" not exist in local repository"
383004576 commented 2 months ago

您好,可以先尝试下在本地直接拉取镜像是否能够成功,另外进入kuscia pod节点中提供下sf的Appimage配置 kubectl get Appimage
kubectl get Appimage xxx -oyaml

magic-hya commented 2 months ago

宿主机拉取镜像

$ docker pull harbor.com/secretflow/secretflow-lite-anolis8:1.7.0b0
1.7.0b0: Pulling from secretflow/secretflow-lite-anolis8
Digest: sha256:9c2ea53baf6f252d31cc7fc46cbd878b85321d3edc1009637ac96a37088fd8a2
Status: Image is up to date for harbor.com/secretflow/secretflow-lite-anolis8:1.7.0b0
harbor.com/secretflow/secretflow-lite-anolis8:1.7.0b0

进入kuscia master节点

$ kubectl exec -it kuscia-master-55bffb8764-l7nn5 -n kuscia -- bash
$ kubectl get Appimage
NAME               AGE
secretflow-image   3d22h

appimage信息

$ kubectl get Appimage secretflow-image -oyaml
apiVersion: kuscia.secretflow/v1alpha1
kind: AppImage
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kuscia.secretflow/v1alpha1","kind":"AppImage","metadata":{"annotations":{},"name":"secretflow-image"},"spec":{"configTemplates":{"task-config.conf":"{\n  \"task_id\": \"{{.TASK_ID}}\",\n  \"task_input_config\": \"{{.TASK_INPUT_CONFIG}}\",\n  \"task_cluster_def\": \"{{.TASK_CLUSTER_DEFINE}}\",\n  \"allocated_ports\": \"{{.ALLOCATED_PORTS}}\"\n}\n"},"deployTemplates":[{"name":"secretflow","replicas":1,"spec":{"containers":[{"args":["-c","python -m secretflow.kuscia.entry ./kuscia/task-config.conf"],"command":["sh"],"configVolumeMounts":[{"mountPath":"/root/kuscia/task-config.conf","subPath":"task-config.conf"}],"name":"secretflow","ports":[{"name":"spu","port":20000,"protocol":"GRPC","scope":"Cluster"},{"name":"fed","port":20001,"protocol":"GRPC","scope":"Cluster"},{"name":"global","port":20002,"protocol":"GRPC","scope":"Domain"},{"name":"node-manager","port":20003,"protocol":"GRPC","scope":"Local"},{"name":"object-manager","port":20004,"protocol":"GRPC","scope":"Local"},{"name":"client-server","port":20005,"protocol":"GRPC","scope":"Local"}],"workingDir":"/root"}],"restartPolicy":"Never"}}],"image":{"id":"abc","name":"harbor.com/secretflow/secretflow-lite-anolis8","sign":"abc","tag":"1.7.0b0"}}}
  creationTimestamp: "2024-08-30T03:21:20Z"
  generation: 1
  name: secretflow-image
  resourceVersion: "229735"
  uid: 690b08d6-2064-4feb-8107-94005cbfe166
spec:
  configTemplates:
    task-config.conf: |
      {
        "task_id": "{{.TASK_ID}}",
        "task_input_config": "{{.TASK_INPUT_CONFIG}}",
        "task_cluster_def": "{{.TASK_CLUSTER_DEFINE}}",
        "allocated_ports": "{{.ALLOCATED_PORTS}}"
      }
  deployTemplates:
  - name: secretflow
    replicas: 1
    spec:
      containers:
      - args:
        - -c
        - python -m secretflow.kuscia.entry ./kuscia/task-config.conf
        command:
        - sh
        configVolumeMounts:
        - mountPath: /root/kuscia/task-config.conf
          subPath: task-config.conf
        name: secretflow
        ports:
        - name: spu
          port: 20000
          protocol: GRPC
          scope: Cluster
        - name: fed
          port: 20001
          protocol: GRPC
          scope: Cluster
        - name: global
          port: 20002
          protocol: GRPC
          scope: Domain
        - name: node-manager
          port: 20003
          protocol: GRPC
          scope: Local
        - name: object-manager
          port: 20004
          protocol: GRPC
          scope: Local
        - name: client-server
          port: 20005
          protocol: GRPC
          scope: Local
        workingDir: /root
      restartPolicy: Never
  image:
    id: abc
    name: harbor.com/secretflow/secretflow-lite-anolis8
    sign: abc
    tag: 1.7.0b0
zimu-yuxi commented 2 months ago

pullPolicy: remote 这个配置去掉。重启下容器再次尝试下

magic-hya commented 2 months ago

删除配置后

    # agent 镜像配置
    image:
      defaultRegistry: "harbor"
      registries:
        - name: "harbor"
          endpoint: "harbor.com/secretflow"
          username: "admin"
          password: "Harbor12345"

重新应用配置

kubectl apply -f configmap_lite_alice.yaml
kubectl apply -f configmap_lite_bob.yaml

删除原有pod

kubectl delete pod kuscia-lite-alice-7ffc99c87d-6d96b -n kuscia
kubectl delete pod kuscia-lite-bob-78c7d58487-5gkxj -n kuscia

发起任务后仍然报错

$ kubectl logs secretflow-task-20240905164634-single-psi-0 -n alice
Error from server: Get "https://192.168.30.173:10250/containerLogs/alice/secretflow-task-20240905164634-single-psi-0/secretflow": proxy error from 0.0.0.0:6443 while dialing 192.168.30.173:10250, code 502: 502 Bad Gateway
zimu-yuxi commented 2 months ago

在kuscia容器内尝试下kuscia image pull harbor.com/secretflow/secretflow-lite-anolis8 --creds admin: Harbor12345

magic-hya commented 2 months ago
$ kuscia image pull harbor.com/secretflow/secretflow-lite-anolis8 --creds admin: Harbor12345
Error: unknown flag: --creds
unknown flag: --creds

看来是命令错误

zimu-yuxi commented 2 months ago
$ kuscia image pull harbor.com/secretflow/secretflow-lite-anolis8 --creds admin: Harbor12345
Error: unknown flag: --creds
unknown flag: --creds

看来是命令错误

可能格式有问题,参考这个kuscia image pull --creds username:password image:tag

magic-hya commented 2 months ago

命令好像没有--creds参数

$ kuscia image pull --creds admin:Harbor12345 harbor.com/secretflow/kuscia-secretflow:v1
Error: unknown flag: --creds
unknown flag: --creds
$ kuscia image pull --help
Manage images

Usage:
  kuscia image [command]

Available Commands:
  builtin     Load a built-in image
  load        Load an image from a tar archive or STDIN

Flags:
  -h, --help           help for image
      --store string   kuscia image storage directory (default "/root/.kuscia/var/images")

Use "kuscia image [command] --help" for more information about a command.
zimu-yuxi commented 2 months ago

kuscia -v,看下kuscia版本号

magic-hya commented 2 months ago

使用的是runp部署方式

secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia-secretflow

$ kuscia -v
kuscia version 6994ca0
yushiqie commented 2 months ago

请使用正确的 kuscia 0.11.0b0 的版本 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.11.0b0

magic-hya commented 2 months ago

我使用的是RunP模式,官方给出的镜像是这个

下文将以物理机和 K8s 两种部署环境为例,来介绍基于 RunP 的部署流程。

在物理机上部署
完整的详细流程请参考 [多机部署中心化集群](https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/deployment/Docker_deployment_kuscia/deploy_master_lite_cn) 和 [多机部署点对点集群](https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/deployment/Docker_deployment_kuscia/deploy_p2p_cn)。

其中,使用 RunP 部署的不同点是:

使用 kuscia-secretflow 镜像。
export KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia-secretflow
github-actions[bot] commented 1 month ago

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.