secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
73 stars 52 forks source link

在k8s部署kuscia中心模式,执行测试任务一直running #344

Open magic-hya opened 4 months ago

magic-hya commented 4 months ago

Issue Type

Running

Search for existing issues similar to yours

Yes

OS Platform and Distribution

CentOS Linux 7

Kuscia Version

latest

Deployment

k8s

deployment Version

k8s 1.22.2

App Running type

secretflow

App Running version

latest

Configuration file used to run kuscia.

参照官方k8s部署中心化kuscia配置部署

What happend and What you expected to happen.

## 任务发起成功
[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# scripts/user/create_example_job.sh
/home/kuscia /home/kuscia
With JOB_EXAMPLE=PSI, job via APP_IMAGE=secretflow-image creating ...
kusciajob.kuscia.secretflow/secretflow-task-20240620111611 created
Job 'secretflow-task-20240620111611' created successfully. You can use the following command to display job status:
  kubectl get kj -n cross-domain

## 查看任务一直在运行
[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl get kj -n cross-domain
NAME                             STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
secretflow-task-20240620111611   12s                          11s                 Running

Kuscia log output.

## 查看alice和bob pod运行状况有error
[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl logs secretflow-task-20240620111611-single-psi-0 -n alice
Error from server: Get "https://192.168.62.11:10250/containerLogs/alice/secretflow-task-20240620111611-single-psi-0/secretflow": proxy error from 0.0.0.0:6443 while dialing 192.168.62.11:10250, code 502: 502 Bad Gateway

[root@kuscia-master-76c5b5bc7b-s84k8 kuscia]# kubectl logs secretflow-task-20240620111611-single-psi-0 -n bob
Error from server: Get "https://192.168.62.12:10250/containerLogs/bob/secretflow-task-20240620111611-single-psi-0/secretflow": proxy error from 0.0.0.0:6443 while dialing 192.168.62.12:10250, code 502: 502 Bad Gateway
magic-hya commented 3 months ago

目前demo流程,我设置成NOTLS,kuscia部署成功,执行脚本创建数据 进入master pod

scripts/deploy/create_domaindata_alice_table.sh alice

curl -X POST 'http://127.0.0.1:8082/api/v1/domaindatagrant/create' \
     --header 'Content-Type: application/json' \
     -d '{ "grant_domain": "bob",
           "description": {"domaindatagrant":"alice-bob"},
           "domain_id": "alice",
           "domaindata_id": "alice-table"
     }'

scripts/deploy/create_domaindata_bob_table.sh bob

curl -X POST 'http://127.0.0.1:8082/api/v1/domaindatagrant/create' \
     --header 'Content-Type: application/json' \
     -d '{ "grant_domain": "alice",
           "description": {"domaindatagrant":"bob-alice"},
           "domain_id": "bob",
           "domaindata_id": "bob-table"
     }'

然后执行创建job命令报错,查看的文档和参考脚本来源

curl -k -X POST 'https://localhost:8082/api/v1/job/create' \
 --header 'Content-Type: application/json' \
 -d '{
  "job_id": "job-alice-bob-001",
  "initiator": "alice",
  "max_parallelism": 2,
  "tasks": [
    {
      "task_id": "job-psi",
      "app_image": "secretflow-image",
      "parties": [
        {
          "domain_id": "alice",
          "role": "partner"
        },
        {
          "domain_id": "bob",
          "role": "partner"
        }
      ],
      "alias": "job-psi",
      "dependencies": [],
      "task_input_config": "{\"sf_datasource_config\":{\"alice\":{\"id\":\"default-data-source\"},\"bob\":{\"id\":\"default-data-source\"}},\"sf_cluster_desc\":{\"parties\":[\"alice\",\"bob\"],\"devices\":[{\"name\":\"spu\",\"type\":\"spu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"runtime_config\\\":{\\\"protocol\\\":\\\"REF2K\\\",\\\"field\\\":\\\"FM64\\\"},\\\"link_desc\\\":{\\\"connect_retry_times\\\":60,\\\"connect_retry_interval_ms\\\":1000,\\\"brpc_channel_protocol\\\":\\\"http\\\",\\\"brpc_channel_connection_type\\\":\\\"pooled\\\",\\\"recv_timeout_ms\\\":1200000,\\\"http_timeout_ms\\\":1200000}}\"},{\"name\":\"heu\",\"type\":\"heu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"mode\\\": \\\"PHEU\\\", \\\"schema\\\": \\\"paillier\\\", \\\"key_size\\\": 2048}\"}],\"ray_fed_config\":{\"cross_silo_comm_backend\":\"brpc_link\"}},\"sf_node_eval_param\":{\"domain\":\"data_prep\",\"name\":\"psi\",\"version\":\"0.0.5\",\"attr_paths\":[\"protocol\",\"sort_result\",\"allow_duplicate_keys\",\"allow_duplicate_keys/yes/join_type\",\"allow_duplicate_keys/yes/join_type/left_join/left_side\",\"input/receiver_input/key\",\"input/sender_input/key\"],\"attrs\":[{\"s\":\"PROTOCOL_ECDH\"},{\"b\":true},{\"s\":\"yes\"},{\"s\":\"left_join\"},{\"ss\":[\"alice\"]},{\"ss\":[\"id1\"]},{\"ss\":[\"id2\"]}]},\"sf_input_ids\":[\"alice-table\",\"bob-table\"],\"sf_output_ids\":[\"psi-output\"],\"sf_output_uris\":[\"psi-output.csv\"]}",
      "priority": 100
    },
    {
      "task_id": "job-split",
      "app_image": "secretflow-image",
      "parties": [
        {
          "domain_id": "alice",
          "role": "partner"
        },
        {
          "domain_id": "bob",
          "role": "partner"
        }
      ],
      "alias": "job-split",
      "dependencies": [
        "job-psi"
      ],
      "task_input_config": "{\"sf_datasource_config\":{\"alice\":{\"id\":\"default-data-source\"},\"bob\":{\"id\":\"default-data-source\"}},\"sf_cluster_desc\":{\"parties\":[\"alice\",\"bob\"],\"devices\":[{\"name\":\"spu\",\"type\":\"spu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"runtime_config\\\":{\\\"protocol\\\":\\\"REF2K\\\",\\\"field\\\":\\\"FM64\\\"},\\\"link_desc\\\":{\\\"connect_retry_times\\\":60,\\\"connect_retry_interval_ms\\\":1000,\\\"brpc_channel_protocol\\\":\\\"http\\\",\\\"brpc_channel_connection_type\\\":\\\"pooled\\\",\\\"recv_timeout_ms\\\":1200000,\\\"http_timeout_ms\\\":1200000}}\"},{\"name\":\"heu\",\"type\":\"heu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"mode\\\": \\\"PHEU\\\", \\\"schema\\\": \\\"paillier\\\", \\\"key_size\\\": 2048}\"}],\"ray_fed_config\":{\"cross_silo_comm_backend\":\"brpc_link\"}},\"sf_node_eval_param\":{\"domain\":\"data_prep\",\"name\":\"train_test_split\",\"version\":\"0.0.1\",\"attr_paths\":[\"train_size\",\"test_size\",\"random_state\",\"shuffle\"],\"attrs\":[{\"f\":0.75},{\"f\":0.25},{\"i64\":1234},{\"b\":true}]},\"sf_output_uris\":[\"train-dataset.csv\",\"test-dataset.csv\"],\"sf_output_ids\":[\"train-dataset\",\"test-dataset\"],\"sf_input_ids\":[\"psi-output\"]}",
      "priority": 100
    }
  ]
}'

执行后报错如下:

curl: (35) OpenSSL/3.0.12: error:0A00010B:SSL routines::wrong version number

不知道怎么处理这个问题

aokaokd commented 3 months ago

notls直接使用http调用即可

magic-hya commented 3 months ago

镜像仓库策略,我看文档有说不支持pullPolicy,那么怎么获取AppImage内的镜像,我下面配置了私有仓库,查看日志报错依然是使用的local

"Failed to inspect image \"harbor.com/secretflow/secretflow-lite-anolis8:1.6.1b0\":
        failed to get image \"harbor.com/secretflow/secretflow-lite-anolis8:1.6.1b0\"
        manifest, detail-> image \"harbor.com/secretflow/secretflow-lite-anolis8:1.6.1b0\"
        not exist in local repository"'

alice和bob的configmap 镜像配置

    image:
      pullPolicy: remote
      defaultRegistry: "harbor"
      registries:
        - name: "harbor"
          endpoint: "harbor.com/secretflow"
          username: "admin"
          password: "Harbor12345"

如果为local时,镜像需要手动导入kuscia内,如果镜像没有导入kuscia,任务会启动失败。local模式因为不拉取远程镜像,安全性会更高,但会有易用性的损失,用户可结合业务场景自行选择。

如何手动导入镜像到kuscia内

magic-hya commented 3 months ago

导入镜像的脚本过时,在k8部署环境下,该脚本无法生成正确的容器名,我修改脚本后,执行 docker exec -it "${KUSCIA_MASTER_CONTAINER_NAME}" kubectl apply -f "${APP_IMAGE_TEMP_FILE}" || exit 1 时还会出现错误 The connection to the server localhost:8080 was refused - did you specify the right host or port? 目前问题在这里

zimu-yuxi commented 3 months ago

k8s runp吗?

magic-hya commented 3 months ago

是的,在另一个issues下有留言,也是这个问题

aokaokd commented 3 months ago

你好,你用的脚本是哪个

magic-hya commented 3 months ago

参考这个文档 register_app_image.sh这个脚本对于k8s部署的kuscia是有问题的,我参考主分支的脚本修改了下,去除了-u命令加了-c命令,然后就出现这个问题

aokaokd commented 3 months ago

目前官网的文档有些老,我们内部确认一下

aokaokd commented 3 months ago

这个在最新发版中有调整,确认下你的kuscia镜像版本,使用对应版本的脚本文件。 另外不要使用lastest镜像。指定下对应的版本号

magic-hya commented 3 months ago

那么我使用最新的9版本测试一下

Chrisdehe commented 2 months ago

@magic-hya hey,请问最新版本是否测试成功?以及目前是否还存在其他易用性问题,欢迎提出建议。

magic-hya commented 2 months ago

@magic-hya hey,请问最新版本是否测试成功?以及目前是否还存在其他易用性问题,欢迎提出建议。

你好,最近为了满足项目,我先用docker环境迁移到k8s上,先满足项目基础功能。后面我会测试最新的版本,大概在1-2周内。

magic-hya commented 2 months ago

使用kuscia v0.11.0b0部署,appimage远程拉取镜像时报错:

[root@kuscia-master-55bffb8764-l7nn5 kuscia]# kubectl logs secretflow-task-20240830112925-single-psi-0 -n bob
Error from server: Get "https://192.168.170.47:10250/containerLogs/bob/secretflow-task-20240830112925-single-psi-0/secretflow": proxy error from 0.0.0.0:6443 while dialing 192.168.170.47:10250, code 502: 502 Bad Gateway

是否这个版本还是不支持pullPolicy:remote