secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
73 stars 53 forks source link

start_secretpad.sh脚本只支持master alice、bob节点在一台服务器的情况 #240

Closed linushio closed 8 months ago

linushio commented 8 months ago

copy kuscia api lite:alice client certs

copy_kuscia_api_lite_client_certs ${ALICE_DOMAIN} ${volume_path}

copy kuscia api lite:bob client certs

copy_kuscia_api_lite_client_certs ${BOB_DOMAIN} ${volume_path} 获取证书是从当前服务器获取的 function copy_kuscia_api_lite_client_certs() { local domain_id=$1 local volume_path=$2 local IMAGE=$SECRETPAD_IMAGE local domain_ctr=${CTR_PREFIX}-lite-${domain_id}

generate client certs

docker exec -it ${domain_ctr} sh scripts/deploy/init_kusciaapi_client_certs.sh

copy result

tmp_path=${volume_path}/temps/certs/${domain_id} mkdir -p ${tmp_path} docker cp ${domain_ctr}:/${CTR_CERT_ROOT}/ca.crt ${tmp_path}/ca.crt docker cp ${domain_ctr}:/${CTR_CERT_ROOT}/kusciaapi-client.crt ${tmp_path}/client.crt docker cp ${domain_ctr}:/${CTR_CERT_ROOT}/kusciaapi-client.key ${tmp_path}/client.pem docker cp ${domain_ctr}:/${CTR_CERT_ROOT}/token ${tmp_path}/token docker run -d --rm --name ${CTR_PREFIX}-dummy --volume=${volume_path}/secretpad/config/certs:/tmp/temp $IMAGE tail -f /dev/null >/dev/null 2>&1 docker cp -a ${tmp_path} ${CTR_PREFIX}-dummy:/tmp/temp/ docker rm -f ${CTR_PREFIX}-dummy >/dev/null 2>&1 rm -rf ${volume_path}/temp log "copy kuscia api client lite :${domain_id} certs to web server container done" }

linushio commented 8 months ago

通过修改脚本后从其他服务器获得证书,成功部署secretpad后,psi算法报错,这里是alice节点的容器日志 ":"b2010715-6bbf-4001-90e6-04014fc4ef20"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastTransitionTime":"2024-03-02T05:58:00Z","reason":"PodFailed","status":"False","type":"Ready"},{"lastTransitionTime":"2024-03-02T05:58:00Z","reason":"PodFailed","status":"False","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://bd550f8449efb75a8b893aef3f0aad9774626959f5216af00343e6d0aebba8b3","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120","imageID":"sha256:f1c20d8cb5c4c69d3997527e4912e794ba3cd7fa26bfaf6afa1383697c80ea9a","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://bd550f8449efb75a8b893aef3f0aad9774626959f5216af00343e6d0aebba8b3","exitCode":1,"finishedAt":"2024-03-02T05:58:00Z","message":"WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.\n2024-03-02 05:57:56,466|alice|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='jtef-wfpdsawa-node-3-0-global.alice.svc', ray_node_manager_port=26768, ray_object_manager_port=26769, ray_client_server_port=26770, ray_worker_ports=[], ray_gcs_port=26767)\n2024-03-02 05:57:56,466|alice|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at jtef-wfpdsawa-node-3-0-global.alice.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=24 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=jtef-wfpdsawa-node-3-0-global.alice.svc --port=26767 --node-manager-port=26768 --object-manager-port=26769 --ray-client-server-port=26770\n2024-03-02 05:58:00,032|alice|INFO|secretflow|entry.py:start_ray:76| 2024-03-02 05:57:57,002\tINFO usage_lib.py:490 -- Usage stats collection is disabled.\n2024-03-02 05:57:57,002\tINFO scripts.py:702 -- Local node IP: jtef-wfpdsawa-node-3-0-global.alice.svc\n2024-03-02 05:57:59,898\tSUCC scripts.py:739 -- --------------------\n2024-03-02 05:57:59,898\tSUCC scripts.py:740 -- Ray runtime started.\n2024-03-02 05:57:59,898\tSUCC scripts.py:741 -- --------------------\n2024-03-02 05:57:59,898\tINFO scripts.py:743 -- Next steps\n2024-03-02 05:57:59,899\tINFO scripts.py:744 -- To connect to this Ray runtime from another node, run\n2024-03-02 05:57:59,899\tINFO scripts.py:747 -- ray start --address='jtef-wfpdsawa-node-3-0-global.alice.svc:26767'\n2024-03-02 05:57:59,899\tINFO scripts.py:763 -- Alternatively, use the following Python code:\n2024-03-02 05:57:59,899\tINFO scripts.py:765 -- import ray\n2024-03-02 05:57:59,899\tINFO scripts.py:769 -- ray.init(address='auto', _node_ip_address='jtef-wfpdsawa-node-3-0-global.alice.svc')\n2024-03-02 05:57:59,899\tINFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to\n2024-03-02 05:57:59,899\tINFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following\n2024-03-02 05:57:59,899\tINFO scripts.py:789 -- Python code:\n2024-03-02 05:57:59,899\tINFO scripts.py:791 -- import ray\n2024-03-02 05:57:59,899\tINFO scripts.py:792 -- ray.init(address='ray://\u003chead_node_ip_address\u003e:26770')\n2024-03-02 05:57:59,899\tINFO scripts.py:801 -- To see the status of the cluster, use\n2024-03-02 05:57:59,899\tINFO scripts.py:802 -- ray status\n2024-03-02 05:57:59,899\tINFO scripts.py:812 -- If connection fails, check your firewall settings and network configuration.\n2024-03-02 05:57:59,899\tINFO scripts.py:820 -- To terminate the Ray runtime, run\n2024-03-02 05:57:59,899\tINFO scripts.py:821 -- ray stop\n\n2024-03-02 05:58:00,033|alice|INFO|secretflow|entry.py:start_ray:77| Succeeded to start ray head node at jtef-wfpdsawa-node-3-0-global.alice.svc.\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.8/runpy.py\", line 194, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File \"/usr/local/lib/python3.8/runpy.py\", line 87, in _run_code\n exec(code, run_globals)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 294, in \u003cmodule\u003e\n main()\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1157, in call\n return self.main(args, kwargs)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1078, in main\n rv = self.invoke(ctx)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, ctx.params)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 783, in invoke\n return __callback(args, **kwargs)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 261, in main\n sf_node_eval_param = preprocess_sf_node_eval_param(\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 92, in preprocess_sf_node_eval_param\n comp_def = get_comp_def(param.domain, param.name, param.version)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/component/entry.py\", line 104, in get_comp_def\n assert key in COMP_MAP\nAssertionError\n","reason":"Error","startedAt":"2024-03-02T05:57:54Z"}}}]}}

在alice、bob服务器分别增加脚本 docker exec -it root-kuscia-lite-alice sh scripts/deploy/init_kusciaapi_client_certs.sh docker cp root-kuscia-lite-alice:/home/kuscia/var/certs/kusciaapi-client.crt ./certs/client.crt docker cp root-kuscia-lite-alice:/home/kuscia/var/certs/kusciaapi-client.key ./certs/client.pem docker cp root-kuscia-lite-alice:/home/kuscia/var/certs/token ./certs/token docker cp root-kuscia-lite-alice:/home/kuscia/var/certs/ca.crt ./certs/ca.crt sudo scp ./certs/* app@192.168.50.192:/home/app/project/secretpad/temps/certs/alice

linushio commented 8 months ago

下边是master容器日志,这个意思是alice的证书不匹配吗? 2024-03-02 15:05:23.603 ERROR controller/regitser_node.go:256 public not match 2024-03-02 15:05:23.605 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-02 15:05:24.622 ERROR controller/regitser_node.go:256 public not match 2024-03-02 15:05:24.622 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-02 15:05:25.641 ERROR controller/regitser_node.go:256 public not match 2024-03-02 15:05:25.641 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-02 15:05:26.661 ERROR controller/regitser_node.go:256 public not match 2024-03-02 15:05:26.661 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-02 15:05:27.680 ERROR controller/regitser_node.go:256 public not match 2024-03-02 15:05:27.680 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)

Chrisdehe commented 8 months ago

看报错是组件缺少了一些配置,请问你目前执行的任务是什么呢? 另外可以参考下kuscia FAQ是否有匹配的问题。

linushio commented 8 months ago

看报错是组件缺少了一些配置,请问你目前执行的任务是什么呢? 另外可以参考下kuscia FAQ是否有匹配的问题。

image 是在secretpad执行psi报错,kusica部署版本为0.5.0b0中心化,secretpad为lastest

linushio commented 8 months ago

master错误日志部分: 2024-03-04 11:12:39.118 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-04 11:12:40.138 ERROR controller/regitser_node.go:256 public not match 2024-03-04 11:12:40.139 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-04 11:12:41.160 ERROR controller/regitser_node.go:256 public not match 2024-03-04 11:12:41.160 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-04 11:13:24.563 ERROR service/domaindata_grant.go:125 Query DomainDataGrant failed, error:domaindatagrants.kuscia.secretflow "alice-table-bob" not found 2024-03-04 11:13:24.637 ERROR service/domaindata_grant.go:125 Query DomainDataGrant failed, error:domaindatagrants.kuscia.secretflow "bob-table-alice" not found 2024-03-04 11:13:46.250 ERROR controller/regitser_node.go:256 public not match 2024-03-04 11:13:46.250 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-04 11:13:47.267 ERROR controller/regitser_node.go:256 public not match 2024-03-04 11:13:47.267 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-04 11:13:48.288 ERROR controller/regitser_node.go:256 public not match 2024-03-04 11:13:48.288 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-04 11:13:49.307 ERROR controller/regitser_node.go:256 public not match 2024-03-04 11:13:49.307 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-04 11:13:50.326 ERROR controller/regitser_node.go:256 public not match 2024-03-04 11:13:50.326 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-04 11:13:50.459 WARN resources/kusciatask.go:76 Failed to update kuscia task "snha-wocixsij-node-3" status since the resource version changed, skip updating it, last updating error: Operation cannot be fulfilled on kusciatasks.kuscia.secretflow "snha-wocixsij-node-3": the object has been modified; please apply your changes to the latest version and try again 2024-03-04 11:13:52.227 WARN resources/kusciatask.go:76 Failed to update kuscia task "snha-wocixsij-node-3" status since the resource version changed, skip updating it, last updating error: Operation cannot be fulfilled on kusciatasks.kuscia.secretflow "snha-wocixsij-node-3": the object has been modified; please apply your changes to the latest version and try again 2024-03-04 11:13:53.388 WARN resources/kusciatask.go:76 Failed to update kuscia task "snha-wocixsij-node-3" status since the resource version changed, skip updating it, last updating error: Operation cannot be fulfilled on kusciatasks.kuscia.secretflow "snha-wocixsij-node-3": the object has been modified; please apply your changes to the latest version and try again 2024-03-04 11:13:53.534 WARN resources/kusciatask.go:76 Failed to update kuscia task "snha-wocixsij-node-3" status since the resource version changed, skip updating it, last updating error: Operation cannot be fulfilled on kusciatasks.kuscia.secretflow "snha-wocixsij-node-3": the object has been modified; please apply your changes to the latest version and try again 2024-03-04 11:13:59.315 WARN resources/kusciatask.go:76 Failed to update kuscia task "snha-wocixsij-node-3" status since the resource version changed, skip updating it, last updating error: Operation cannot be fulfilled on kusciatasks.kuscia.secretflow "snha-wocixsij-node-3": the object has been modified; please apply your changes to the latest version and try again 2024-03-04 11:14:55.455 ERROR controller/regitser_node.go:256 public not match 2024-03-04 11:14:55.455 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-04 11:14:56.475 ERROR controller/regitser_node.go:256 public not match 2024-03-04 11:14:56.475 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-04 11:14:57.491 ERROR controller/regitser_node.go:256 public not match 2024-03-04 11:14:57.491 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-04 11:14:58.510 ERROR controller/regitser_node.go:256 public not match alice错误日志为: 2024-03-04 09:58:55.554 WARN controller/domain_route.go:223 request error, path: kuscia-handshake.bob.svc/handshake, code: 503, message: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 2024-03-04 09:59:10.554 WARN controller/domain_route.go:223 request error, path: kuscia-handshake.bob.svc/handshake, code: 503, message: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 2024-03-04 09:59:25.554 WARN controller/domain_route.go:223 request error, path: kuscia-handshake.bob.svc/handshake, code: 503, message: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 2024-03-04 09:59:40.556 WARN controller/domain_route.go:223 request error, path: kuscia-handshake.bob.svc/handshake, code: 503, message: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 2024-03-04 11:13:58.990 INFO status/status_manager.go:625 Patch status for pod "snha-wocixsij-node-3-0_alice(a4e041b3-40d6-4335-9f54-15b4bc478c9b)", patch={"metadata":{"uid":"a4e041b3-40d6-4335-9f54-15b4bc478c9b"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastTransitionTime":"2024-03-04T03:13:58Z","reason":"PodFailed","status":"False","type":"Ready"},{"lastTransitionTime":"2024-03-04T03:13:58Z","reason":"PodFailed","status":"False","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://0061342aa79e83b37615af31472e1243a275babcca5fd817cf93bf9de3461871","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120","imageID":"sha256:f1c20d8cb5c4c69d3997527e4912e794ba3cd7fa26bfaf6afa1383697c80ea9a","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://0061342aa79e83b37615af31472e1243a275babcca5fd817cf93bf9de3461871","exitCode":1,"finishedAt":"2024-03-04T03:13:58Z","message":"WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.\n2024-03-04 03:13:55,390|alice|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='snha-wocixsij-node-3-0-global.alice.svc', ray_node_manager_port=24086, ray_object_manager_port=24087, ray_client_server_port=24088, ray_worker_ports=[], ray_gcs_port=24091)\n2024-03-04 03:13:55,390|alice|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at snha-wocixsij-node-3-0-global.alice.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=24 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=snha-wocixsij-node-3-0-global.alice.svc --port=24091 --node-manager-port=24086 --object-manager-port=24087 --ray-client-server-port=24088\n2024-03-04 03:13:57,854|alice|INFO|secretflow|entry.py:start_ray:76| 2024-03-04 03:13:55,924\tINFO usage_lib.py:490 -- Usage stats collection is disabled.\n2024-03-04 03:13:55,924\tINFO scripts.py:702 -- Local node IP: snha-wocixsij-node-3-0-global.alice.svc\n2024-03-04 03:13:57,695\tSUCC scripts.py:739 -- --------------------\n2024-03-04 03:13:57,695\tSUCC scripts.py:740 -- Ray runtime started.\n2024-03-04 03:13:57,695\tSUCC scripts.py:741 -- --------------------\n2024-03-04 03:13:57,695\tINFO scripts.py:743 -- Next steps\n2024-03-04 03:13:57,695\tINFO scripts.py:744 -- To connect to this Ray runtime from another node, run\n2024-03-04 03:13:57,695\tINFO scripts.py:747 -- ray start --address='snha-wocixsij-node-3-0-global.alice.svc:24091'\n2024-03-04 03:13:57,695\tINFO scripts.py:763 -- Alternatively, use the following Python code:\n2024-03-04 03:13:57,695\tINFO scripts.py:765 -- import ray\n2024-03-04 03:13:57,695\tINFO scripts.py:769 -- ray.init(address='auto', _node_ip_address='snha-wocixsij-node-3-0-global.alice.svc')\n2024-03-04 03:13:57,695\tINFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to\n2024-03-04 03:13:57,695\tINFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following\n2024-03-04 03:13:57,695\tINFO scripts.py:789 -- Python code:\n2024-03-04 03:13:57,695\tINFO scripts.py:791 -- import ray\n2024-03-04 03:13:57,695\tINFO scripts.py:792 -- ray.init(address='ray://\u003chead_node_ip_address\u003e:24088')\n2024-03-04 03:13:57,695\tINFO scripts.py:801 -- To see the status of the cluster, use\n2024-03-04 03:13:57,695\tINFO scripts.py:802 -- ray status\n2024-03-04 03:13:57,695\tINFO scripts.py:812 -- If connection fails, check your firewall settings and network configuration.\n2024-03-04 03:13:57,695\tINFO scripts.py:820 -- To terminate the Ray runtime, run\n2024-03-04 03:13:57,695\tINFO scripts.py:821 -- ray stop\n\n2024-03-04 03:13:57,854|alice|INFO|secretflow|entry.py:start_ray:77| Succeeded to start ray head node at snha-wocixsij-node-3-0-global.alice.svc.\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.8/runpy.py\", line 194, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File \"/usr/local/lib/python3.8/runpy.py\", line 87, in _run_code\n exec(code, run_globals)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 294, in \u003cmodule\u003e\n main()\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1157, in call\n return self.main(args, kwargs)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1078, in main\n rv = self.invoke(ctx)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, ctx.params)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 783, in invoke\n return __callback(args, kwargs)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 261, in main\n sf_node_eval_param = preprocess_sf_node_eval_param(\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 92, in preprocess_sf_node_eval_param\n comp_def = get_comp_def(param.domain, param.name, param.version)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/component/entry.py\", line 104, in get_comp_def\n assert key in COMP_MAP\nAssertionError\n","reason":"Error","startedAt":"2024-03-04T03:13:53Z"}}}]}} bob错误日志为:** [root@root-kuscia-lite-bob kuscia]# cat /home/kuscia/var/logs/kuscia.log | grep -i error 2024-03-02 15:26:55.797 ERROR controller/handshake.go:792 Handshake to master fail, return error:invalid public key, key is empty 2024-03-02 15:27:40.843 ERROR xds/xds.go:448 unknown cluster: bob-to-alice-http 2024-03-02 18:17:08.242 ERROR xds/xds.go:448 unknown cluster: bob-to-alice-http 2024-03-04 09:32:18.871 ERROR xds/xds.go:448 unknown cluster: bob-to-alice-http 2024-03-04 09:59:42.637 ERROR xds/xds.go:448 unknown cluster: bob-to-alice-http 2024-03-04 11:14:00.905 INFO status/status_manager.go:625 Patch status for pod "snha-wocixsij-node-3-0_bob(34ae7669-f978-42af-99a4-c0b15b1d9678)", patch={"metadata":{"uid":"34ae7669-f978-42af-99a4-c0b15b1d9678"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastTransitionTime":"2024-03-04T03:14:00Z","reason":"PodFailed","status":"False","type":"Ready"},{"lastTransitionTime":"2024-03-04T03:14:00Z","reason":"PodFailed","status":"False","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://dc21842c202f1d40f516d59e6a1c2e3f730c2ea46a51686e55481c13c109172a","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120","imageID":"sha256:f1c20d8cb5c4c69d3997527e4912e794ba3cd7fa26bfaf6afa1383697c80ea9a","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://dc21842c202f1d40f516d59e6a1c2e3f730c2ea46a51686e55481c13c109172a","exitCode":15,"finishedAt":"2024-03-04T03:13:59Z","message":"WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.\n2024-03-04 03:13:57,245|bob|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='snha-wocixsij-node-3-0-global.bob.svc', ray_node_manager_port=21394, ray_object_manager_port=21395, ray_client_server_port=21390, ray_worker_ports=[], ray_gcs_port=21393)\n2024-03-04 03:13:57,246|bob|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at snha-wocixsij-node-3-0-global.bob.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=40 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=snha-wocixsij-node-3-0-global.bob.svc --port=21393 --node-manager-port=21394 --object-manager-port=21395 --ray-client-server-port=21390\n","reason":"Error","startedAt":"2024-03-04T03:13:53Z"}}}],"phase":"Failed","podIP":null,"podIPs":null}}

linushio commented 8 months ago

部署文档下边的测试作业是运行成功的: image

Chrisdehe commented 8 months ago

看到报错信息是因为bob的证书问题导致连接失败,辛苦检查一下证书,可以重新申请下

linushio commented 8 months ago

看到报错信息是因为bob的证书问题导致连接失败,辛苦检查一下证书,可以重新申请下

这个具体是哪里的步骤?还有就是上边说的secretpad的问题如何解决?我多次重新部署依然有相同的问题,严格按照文档流程,过程中没有异常

gshilei commented 8 months ago

看到报错信息是因为bob的证书问题导致连接失败,辛苦检查一下证书,可以重新申请下

这个具体是哪里的步骤?还有就是上边说的secretpad的问题如何解决?我多次重新部署依然有相同的问题,严格按照文档流程,过程中没有异常

嗯,节点之间的授权可以通过命令 kubectl get cdr 查看。如果显示列表中,最后一列 READY 为 True 表示授权没问题。

此外,你这边执行示例 Job 是运行成功的,但是从平台下发的 Job 是失败的吗? 如果是的话,可以到组件引擎日志目录,把整个日志贴一下, 可参考:https://www.secretflow.org.cn/zh-CN/docs/kuscia/main/deployment/logdescription

示例路径如下: /home/kuscia/var/stdout/pods/alice_xxxx/xxx/*.log

linushio commented 8 months ago

授权是正常的: alice-kuscia-system alice kuscia-system Token True bob-kuscia-system bob kuscia-system Token True alice-bob alice bob 192.168.50.100 Token True bob-alice bob alice 192.168.50.158 Token True

日志如下 alice: 2024-03-04T14:16:45.333370086+08:00 stderr F WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results. 2024-03-04T14:16:45.6195683+08:00 stdout F 2024-03-04 06:16:45,619|alice|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='kcnp-uhkwkbhr-node-3-0-global.alice.svc', ray_node_manager_port=29827, ray_object_manager_port=29828, ray_client_server_port=29829, ray_worker_ports=[], ray_gcs_port=29826) 2024-03-04T14:16:45.619598292+08:00 stdout F 2024-03-04 06:16:45,619|alice|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at kcnp-uhkwkbhr-node-3-0-global.alice.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=24 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=kcnp-uhkwkbhr-node-3-0-global.alice.svc --port=29826 --node-manager-port=29827 --object-manager-port=29828 --ray-client-server-port=298292024-03-04T14:16:48.084653285+08:00 stdout F 2024-03-04 06:16:48,084|alice|INFO|secretflow|entry.py:start_ray:76| 2024-03-04 06:16:46,156 INFO usage_lib.py:490 -- Usage stats collection is disabled. 2024-03-04T14:16:48.084684076+08:00 stdout F 2024-03-04 06:16:46,156 INFO scripts.py:702 -- Local node IP: kcnp-uhkwkbhr-node-3-0-global.alice.svc 2024-03-04T14:16:48.084688078+08:00 stdout F 2024-03-04 06:16:47,926 SUCC scripts.py:739 -- -------------------- 2024-03-04T14:16:48.084691071+08:00 stdout F 2024-03-04 06:16:47,926 SUCC scripts.py:740 -- Ray runtime started. 2024-03-04T14:16:48.08469391+08:00 stdout F 2024-03-04 06:16:47,926 SUCC scripts.py:741 -- -------------------- 2024-03-04T14:16:48.084696986+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:743 -- Next steps 2024-03-04T14:16:48.084699972+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run 2024-03-04T14:16:48.084703229+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:747 -- ray start --address='kcnp-uhkwkbhr-node-3-0-global.alice.svc:29826' 2024-03-04T14:16:48.084706877+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:763 -- Alternatively, use the following Python code: 2024-03-04T14:16:48.084709854+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:765 -- import ray 2024-03-04T14:16:48.084713106+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:769 -- ray.init(address='auto', _node_ip_address='kcnp-uhkwkbhr-node-3-0-global.alice.svc') 2024-03-04T14:16:48.084715932+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to 2024-03-04T14:16:48.084718765+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following 2024-03-04T14:16:48.084721641+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:789 -- Python code: 2024-03-04T14:16:48.084724476+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:791 -- import ray 2024-03-04T14:16:48.084727489+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:792 -- ray.init(address='ray://:29829') 2024-03-04T14:16:48.084730449+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:801 -- To see the status of the cluster, use 2024-03-04T14:16:48.084763215+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:802 -- ray status 2024-03-04T14:16:48.084766432+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:812 -- If connection fails, check your firewall settings and network configuration. 2024-03-04T14:16:48.084769279+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:820 -- To terminate the Ray runtime, run 2024-03-04T14:16:48.084772154+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:821 -- ray stop 2024-03-04T14:16:48.084774734+08:00 stdout F 2024-03-04T14:16:48.08490735+08:00 stdout F 2024-03-04 06:16:48,084|alice|INFO|secretflow|entry.py:start_ray:77| Succeeded to start ray head node at kcnp-uhkwkbhr-node-3-0-global.alice.svc. 2024-03-04T14:16:48.087344183+08:00 stderr F Traceback (most recent call last): 2024-03-04T14:16:48.08736423+08:00 stderr F File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main 2024-03-04T14:16:48.087367488+08:00 stderr F return _run_code(code, main_globals, None, 2024-03-04T14:16:48.087370572+08:00 stderr F File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code 2024-03-04T14:16:48.087373865+08:00 stderr F exec(code, run_globals) 2024-03-04T14:16:48.087384296+08:00 stderr F File "/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py", line 294, in 2024-03-04T14:16:48.08738733+08:00 stderr F main() 2024-03-04T14:16:48.087390484+08:00 stderr F File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1157, in call 2024-03-04T14:16:48.087393276+08:00 stderr F return self.main(args, kwargs) 2024-03-04T14:16:48.087396156+08:00 stderr F File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1078, in main 2024-03-04T14:16:48.087399148+08:00 stderr F rv = self.invoke(ctx) 2024-03-04T14:16:48.087404862+08:00 stderr F File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1434, in invoke 2024-03-04T14:16:48.087412144+08:00 stderr F return ctx.invoke(self.callback, ctx.params) 2024-03-04T14:16:48.08741521+08:00 stderr F File "/usr/local/lib/python3.8/site-packages/click/core.py", line 783, in invoke 2024-03-04T14:16:48.087418133+08:00 stderr F return __callback(args, kwargs) 2024-03-04T14:16:48.087421002+08:00 stderr F File "/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py", line 261, in main 2024-03-04T14:16:48.087423939+08:00 stderr F sf_node_eval_param = preprocess_sf_node_eval_param( 2024-03-04T14:16:48.087427476+08:00 stderr F File "/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py", line 92, in preprocess_sf_node_eval_param 2024-03-04T14:16:48.087430311+08:00 stderr F comp_def = get_comp_def(param.domain, param.name, param.version) 2024-03-04T14:16:48.087433222+08:00 stderr F File "/usr/local/lib/python3.8/site-packages/secretflow/component/entry.py", line 104, in get_comp_def 2024-03-04T14:16:48.087436083+08:00 stderr F assert key in COMP_MAP 2024-03-04T14:16:48.087438998+08:00 stderr F AssertionError bob日志:** 2024-03-04T14:16:45.37312329+08:00 stderr F WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results. 2024-03-04T14:16:45.674617021+08:00 stdout F 2024-03-04 06:16:45,674|bob|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='kcnp-uhkwkbhr-node-3-0-global.bob.svc', ray_node_manager_port=26683, ray_object_manager_port=26684, ray_client_server_port=26685, ray_worker_ports=[], ray_gcs_port=26682) 2024-03-04T14:16:45.674642408+08:00 stdout F 2024-03-04 06:16:45,674|bob|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at kcnp-uhkwkbhr-node-3-0-global.bob.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=40 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=kcnp-uhkwkbhr-node-3-0-global.bob.svc --port=26682 --node-manager-port=26683 --object-manager-port=26684 --ray-client-server-port=26685

linushio commented 8 months ago

另外,我发现通过命令行发起的两个示例job虽然状态是succeeded,但是日志里也是有error NAME STARTTIME COMPLETIONTIME LASTRECONCILETIME PHASE secretflow-task-20240304135549 25m 25m 25m Succeeded secretflow-task-20240304140911 12m 12m 12m Succeeded

alice日志,bob方没有error 2024-03-04T14:09:16.996436339+08:00 stderr F WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results. 2024-03-04T14:09:17.284753351+08:00 stdout F 2024-03-04 06:09:17,284|alice|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='secretflow-task-20240304140911-single-psi-0-global.alice.svc', ray_node_manager_port=30728, ray_object_manager_port=30729, ray_client_server_port=30730, ray_worker_ports=[], ray_gcs_port=30727) 2024-03-04T14:09:17.284773787+08:00 stdout F 2024-03-04 06:09:17,284|alice|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at secretflow-task-20240304140911-single-psi-0-global.alice.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=24 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=secretflow-task-20240304140911-single-psi-0-global.alice.svc --port=30727 --node-manager-port=30728 --object-manager-port=30729 --ray-client-server-port=30730 2024-03-04T14:09:19.748214493+08:00 stdout F 2024-03-04 06:09:19,747|alice|INFO|secretflow|entry.py:start_ray:76| 2024-03-04 06:09:17,823 INFO usage_lib.py:490 -- Usage stats collection is disabled. 2024-03-04T14:09:19.748230881+08:00 stdout F 2024-03-04 06:09:17,823 INFO scripts.py:702 -- Local node IP: secretflow-task-20240304140911-single-psi-0-global.alice.svc 2024-03-04T14:09:19.748277722+08:00 stdout F 2024-03-04 06:09:19,590 SUCC scripts.py:739 -- -------------------- 2024-03-04T14:09:19.748281432+08:00 stdout F 2024-03-04 06:09:19,590 SUCC scripts.py:740 -- Ray runtime started. 2024-03-04T14:09:19.748284355+08:00 stdout F 2024-03-04 06:09:19,591 SUCC scripts.py:741 -- -------------------- 2024-03-04T14:09:19.748290291+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:743 -- Next steps 2024-03-04T14:09:19.748293478+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run 2024-03-04T14:09:19.748296645+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:747 -- ray start --address='secretflow-task-20240304140911-single-psi-0-global.alice.svc:30727' 2024-03-04T14:09:19.748299942+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:763 -- Alternatively, use the following Python code: 2024-03-04T14:09:19.748303057+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:765 -- import ray 2024-03-04T14:09:19.748306008+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:769 -- ray.init(address='auto', _node_ip_address='secretflow-task-20240304140911-single-psi-0-global.alice.svc') 2024-03-04T14:09:19.748310277+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to 2024-03-04T14:09:19.748313127+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following 2024-03-04T14:09:19.748316011+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:789 -- Python code: 2024-03-04T14:09:19.748321392+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:791 -- import ray 2024-03-04T14:09:19.748324395+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:792 -- ray.init(address='ray://:30730') 2024-03-04T14:09:19.748327387+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:801 -- To see the status of the cluster, use 2024-03-04T14:09:19.748330436+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:802 -- ray status 2024-03-04T14:09:19.748333371+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:812 -- If connection fails, check your firewall settings and network configuration. 2024-03-04T14:09:19.748336289+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:820 -- To terminate the Ray runtime, run 2024-03-04T14:09:19.748339709+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:821 -- ray stop 2024-03-04T14:09:19.748342396+08:00 stdout F 2024-03-04T14:09:19.748346587+08:00 stdout F 2024-03-04 06:09:19,748|alice|INFO|secretflow|entry.py:start_ray:77| Succeeded to start ray head node at secretflow-task-20240304140911-single-psi-0-global.alice.svc. 2024-03-04T14:09:19.756409381+08:00 stdout F 2024-03-04 06:09:19,756|alice|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment. 2024-03-04T14:09:19.761398447+08:00 stdout F 2024-03-04 06:09:19,761|alice|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment. 2024-03-04T14:09:19.761892317+08:00 stdout F 2024-03-04 06:09:19,761|alice|WARNING|secretflow|entry.py:comp_eval:116| 2024-03-04T14:09:19.761900392+08:00 stdout F -- 2024-03-04T14:09:19.761903323+08:00 stdout F param 2024-03-04T14:09:19.761905626+08:00 stdout F 2024-03-04T14:09:19.761909059+08:00 stdout F domain: "preprocessing" 2024-03-04T14:09:19.761912105+08:00 stdout F name: "psi" 2024-03-04T14:09:19.76191508+08:00 stdout F version: "0.0.1" 2024-03-04T14:09:19.761918205+08:00 stdout F attr_paths: "input/receiver_input/key" 2024-03-04T14:09:19.761920784+08:00 stdout F attr_paths: "input/sender_input/key" 2024-03-04T14:09:19.761939728+08:00 stdout F attr_paths: "protocol" 2024-03-04T14:09:19.761942826+08:00 stdout F attr_paths: "precheck_input" 2024-03-04T14:09:19.761945312+08:00 stdout F attr_paths: "bucket_size" 2024-03-04T14:09:19.761954955+08:00 stdout F attr_paths: "curve_type" 2024-03-04T14:09:19.761957665+08:00 stdout F attrs { 2024-03-04T14:09:19.761960351+08:00 stdout F ss: "id1" 2024-03-04T14:09:19.761962925+08:00 stdout F } 2024-03-04T14:09:19.761965518+08:00 stdout F attrs { 2024-03-04T14:09:19.761968555+08:00 stdout F ss: "id2" 2024-03-04T14:09:19.761971086+08:00 stdout F } 2024-03-04T14:09:19.761973862+08:00 stdout F attrs { 2024-03-04T14:09:19.761976637+08:00 stdout F s: "ECDH_PSI_2PC" 2024-03-04T14:09:19.761979374+08:00 stdout F } 2024-03-04T14:09:19.761982173+08:00 stdout F attrs { 2024-03-04T14:09:19.761985151+08:00 stdout F b: true 2024-03-04T14:09:19.761987883+08:00 stdout F } 2024-03-04T14:09:19.7619906+08:00 stdout F attrs { 2024-03-04T14:09:19.761993532+08:00 stdout F i64: 1048576 2024-03-04T14:09:19.761996433+08:00 stdout F } 2024-03-04T14:09:19.761999361+08:00 stdout F attrs { 2024-03-04T14:09:19.762001971+08:00 stdout F s: "CURVE_FOURQ" 2024-03-04T14:09:19.762004761+08:00 stdout F } 2024-03-04T14:09:19.762028598+08:00 stdout F inputs { 2024-03-04T14:09:19.762031745+08:00 stdout F name: "alice.csv" 2024-03-04T14:09:19.762034747+08:00 stdout F type: "sf.table.individual" 2024-03-04T14:09:19.762037657+08:00 stdout F meta { 2024-03-04T14:09:19.762040672+08:00 stdout F type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable" 2024-03-04T14:09:19.762046432+08:00 stdout F value: "\n\335\003\022\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\020\377\377\377\377\377\377\377\377\377\001" 2024-03-04T14:09:19.762051807+08:00 stdout F } 2024-03-04T14:09:19.762054788+08:00 stdout F data_refs { 2024-03-04T14:09:19.7620577+08:00 stdout F uri: "alice.csv" 2024-03-04T14:09:19.762060771+08:00 stdout F party: "alice" 2024-03-04T14:09:19.762063764+08:00 stdout F format: "csv" 2024-03-04T14:09:19.762066759+08:00 stdout F } 2024-03-04T14:09:19.762069683+08:00 stdout F } 2024-03-04T14:09:19.762072709+08:00 stdout F inputs { 2024-03-04T14:09:19.762079232+08:00 stdout F name: "bob.csv" 2024-03-04T14:09:19.762103441+08:00 stdout F type: "sf.table.individual" 2024-03-04T14:09:19.762106588+08:00 stdout F meta { 2024-03-04T14:09:19.762109666+08:00 stdout F type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable" 2024-03-04T14:09:19.76211401+08:00 stdout F value: "\n\227\003\022\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\003int\020\377\377\377\377\377\377\377\377\377\001" 2024-03-04T14:09:19.762116941+08:00 stdout F } 2024-03-04T14:09:19.762119869+08:00 stdout F data_refs { 2024-03-04T14:09:19.762122874+08:00 stdout F uri: "bob.csv" 2024-03-04T14:09:19.762125728+08:00 stdout F party: "bob" 2024-03-04T14:09:19.762128665+08:00 stdout F format: "csv" 2024-03-04T14:09:19.762131625+08:00 stdout F } 2024-03-04T14:09:19.762134535+08:00 stdout F } 2024-03-04T14:09:19.762137382+08:00 stdout F output_uris: "psi-output.csv" 2024-03-04T14:09:19.762140036+08:00 stdout F 2024-03-04T14:09:19.76214269+08:00 stdout F -- 2024-03-04T14:09:19.762145072+08:00 stdout F 2024-03-04T14:09:19.762153426+08:00 stdout F 2024-03-04 06:09:19,761|alice|WARNING|secretflow|entry.py:comp_eval:117| 2024-03-04T14:09:19.762156348+08:00 stdout F -- 2024-03-04T14:09:19.762158982+08:00 stdout F storage_config 2024-03-04T14:09:19.762161272+08:00 stdout F 2024-03-04T14:09:19.762163971+08:00 stdout F type: "local_fs" 2024-03-04T14:09:19.762166648+08:00 stdout F local_fs { 2024-03-04T14:09:19.762169373+08:00 stdout F wd: "/home/kuscia/var/storage/data" 2024-03-04T14:09:19.762174971+08:00 stdout F } 2024-03-04T14:09:19.762177475+08:00 stdout F 2024-03-04T14:09:19.762180243+08:00 stdout F -- 2024-03-04T14:09:19.762182357+08:00 stdout F 2024-03-04T14:09:19.762184763+08:00 stdout F 2024-03-04 06:09:19,761|alice|WARNING|secretflow|entry.py:comp_eval:118| 2024-03-04T14:09:19.762187237+08:00 stdout F -- 2024-03-04T14:09:19.762189658+08:00 stdout F cluster_config 2024-03-04T14:09:19.762191819+08:00 stdout F 2024-03-04T14:09:19.762194432+08:00 stdout F desc { 2024-03-04T14:09:19.762197043+08:00 stdout F parties: "alice" 2024-03-04T14:09:19.762199605+08:00 stdout F parties: "bob" 2024-03-04T14:09:19.762202179+08:00 stdout F devices { 2024-03-04T14:09:19.762204574+08:00 stdout F name: "spu" 2024-03-04T14:09:19.762207067+08:00 stdout F type: "spu" 2024-03-04T14:09:19.762209565+08:00 stdout F parties: "alice" 2024-03-04T14:09:19.76221212+08:00 stdout F parties: "bob" 2024-03-04T14:09:19.76221533+08:00 stdout F config: "{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}" 2024-03-04T14:09:19.762217957+08:00 stdout F } 2024-03-04T14:09:19.762220584+08:00 stdout F devices { 2024-03-04T14:09:19.762223103+08:00 stdout F name: "heu" 2024-03-04T14:09:19.762225404+08:00 stdout F type: "heu" 2024-03-04T14:09:19.762227642+08:00 stdout F parties: "alice" 2024-03-04T14:09:19.762229997+08:00 stdout F parties: "bob" 2024-03-04T14:09:19.762232467+08:00 stdout F config: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}" 2024-03-04T14:09:19.762234845+08:00 stdout F } 2024-03-04T14:09:19.762237448+08:00 stdout F ray_fed_config { 2024-03-04T14:09:19.762239979+08:00 stdout F cross_silo_comm_backend: "brpc_link" 2024-03-04T14:09:19.762242637+08:00 stdout F } 2024-03-04T14:09:19.762244983+08:00 stdout F } 2024-03-04T14:09:19.762252215+08:00 stdout F public_config { 2024-03-04T14:09:19.762254703+08:00 stdout F ray_fed_config { 2024-03-04T14:09:19.762257325+08:00 stdout F parties: "alice" 2024-03-04T14:09:19.762259865+08:00 stdout F parties: "bob" 2024-03-04T14:09:19.762262405+08:00 stdout F addresses: "0.0.0.0:30726" 2024-03-04T14:09:19.762264962+08:00 stdout F addresses: "secretflow-task-20240304140911-single-psi-0-fed.bob.svc:80" 2024-03-04T14:09:19.762267511+08:00 stdout F } 2024-03-04T14:09:19.762270296+08:00 stdout F spu_configs { 2024-03-04T14:09:19.762273098+08:00 stdout F name: "spu" 2024-03-04T14:09:19.762275528+08:00 stdout F parties: "alice" 2024-03-04T14:09:19.762278355+08:00 stdout F parties: "bob" 2024-03-04T14:09:19.76228117+08:00 stdout F addresses: "0.0.0.0:30731" 2024-03-04T14:09:19.76228398+08:00 stdout F addresses: "http://secretflow-task-20240304140911-single-psi-0-spu.bob.svc:80" 2024-03-04T14:09:19.762286697+08:00 stdout F } 2024-03-04T14:09:19.762289449+08:00 stdout F } 2024-03-04T14:09:19.762292103+08:00 stdout F private_config { 2024-03-04T14:09:19.762294742+08:00 stdout F self_party: "alice" 2024-03-04T14:09:19.762297499+08:00 stdout F ray_head_addr: "secretflow-task-20240304140911-single-psi-0-global.alice.svc:30727" 2024-03-04T14:09:19.762300319+08:00 stdout F } 2024-03-04T14:09:19.762302969+08:00 stdout F 2024-03-04T14:09:19.762305837+08:00 stdout F -- 2024-03-04T14:09:19.76230829+08:00 stdout F 2024-03-04T14:09:19.762947396+08:00 stdout F 2024-03-04 06:09:19,762|alice|WARNING|secretflow|driver.py:init:432| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment. 2024-03-04T14:09:19.764042611+08:00 stderr F 2024-03-04 06:09:19,763 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: secretflow-task-20240304140911-single-psi-0-global.alice.svc:30727... 2024-03-04T14:09:19.77112593+08:00 stderr F [2024-03-04 06:09:19,771 I 7 7] global_state_accessor.cc:357: This node has an IP address of 10.88.0.3, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container. 2024-03-04T14:09:19.772829614+08:00 stderr F 2024-03-04 06:09:19,772 INFO worker.py:1538 -- Connected to Ray cluster. 2024-03-04T14:09:19.800321672+08:00 stderr F 2024-03-04 06:09:19 INFO api.py:147 [alice] -- Started rayfed with {'CLUSTER_ADDRESSES': {'alice': '0.0.0.0:30726', 'bob': 'http://secretflow-task-20240304140911-single-psi-0-fed.bob.svc:80'}, 'CURRENT_PARTY_NAME': 'alice', 'TLS_CONFIG': {}} 2024-03-04T14:09:19.804385665+08:00 stderr F 2024-03-04 06:09:19 INFO cleanup.py:58 [alice] -- Start check sending thread. 2024-03-04T14:09:19.807289239+08:00 stderr F 2024-03-04 06:09:19 INFO cleanup.py:67 [alice] -- Start check sending monitor thread. 2024-03-04T14:09:19.807378875+08:00 stderr F 2024-03-04 06:09:19 DEBUG barriers.py:389 [alice] -- Starting ReceiverProxyActor with options: {'max_concurrency': 1, 'max_task_retries': 3, 'max_restarts': 1} 2024-03-04T14:09:23.712916081+08:00 stderr F ^[[2m^[[33m(raylet)^[[0m [2024-03-04 06:09:19,475 I 424 424] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 2024-03-04T14:09:23.712943622+08:00 stderr F ^[[2m^[[36m(pid=gcs_server)^[[0m [2024-03-04 06:09:17,833 I 148 148] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 2024-03-04T14:09:23.712949516+08:00 stderr F ^[[2m^[[33m(raylet)^[[0m [2024-03-04 06:09:20,343 I 663 663] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 2024-03-04T14:09:23.71296946+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m I0304 06:09:20.679015 663 external/com_github_brpc_brpc/src/brpc/server.cpp:1127] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=30726. 2024-03-04T14:09:23.712975319+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m I0304 06:09:20.680732 663 external/com_github_brpc_brpc/src/brpc/server.cpp:1130] Check out http://secretflow-task-20240304140911-single-psi-0:30726 in web browser. 2024-03-04T14:09:23.713000899+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m I0304 06:09:22.391448 831 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240304.060922.663/id.db and ./rpc_data/rpcz/20240304.060922.663/time.db 2024-03-04T14:09:23.713007788+08:00 stderr F 2024-03-04 06:09:23 INFO barriers.py:406 [alice] -- Succeeded to create receiver proxy actor. 2024-03-04T14:09:23.713031078+08:00 stderr F 2024-03-04 06:09:23 INFO barriers.py:438 [alice] -- Try ping ['bob'] at 0 attemp, up to 3600 attemps. 2024-03-04T14:09:23.719445357+08:00 stderr F 2024-03-04 06:09:23 WARNING psi.py:180 [alice] -- {'cluster_def': {'nodes': [{'party': 'alice', 'address': '0.0.0.0:30731', 'listen_address': ''}, {'party': 'bob', 'address': 'http://secretflow-task-20240304140911-single-psi-0-spu.bob.svc:80', 'listen_address': ''}], 'runtime_config': {'protocol': 1, 'field': 2}}, 'link_desc': {'connect_retry_times': 60, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'recv_timeout_ms': 1200000, 'http_timeout_ms': 1200000}} 2024-03-04T14:09:23.783323282+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:23 DEBUG barriers.py:345 [alice] -- Sending send data to seq_id ping of bob from ping without credentials. 2024-03-04T14:09:23.783343427+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:23 DEBUG barriers.py:356 [alice] -- Succeeded to send send data to seq_id ping of bob from ping. Response is True 2024-03-04T14:09:23.783350676+08:00 stderr F 2024-03-04 06:09:23 DEBUG fed_actor.py:76 [alice] -- Actor method call: psi_csv, num_returns: 1 2024-03-04T14:09:26.310701498+08:00 stderr F ^[[2m^[[33m(raylet)^[[0m [2024-03-04 06:09:24,310 I 835 835] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 2024-03-04T14:09:26.310717152+08:00 stderr F 2024-03-04 06:09:26 DEBUG pyu.py:105 [alice] -- PYU remote function: <function barrier.. at 0x7f7fe4b84af0>, num_returns=None, args len: 0, kwargs len: 0. 2024-03-04T14:09:26.311014621+08:00 stderr F 2024-03-04 06:09:26 DEBUG pyu.py:105 [alice] -- PYU remote function: <function barrier.. at 0x7f7fe4b84790>, num_returns=None, args len: 0, kwargs len: 0. 2024-03-04T14:09:27.924944095+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:26 DEBUG barriers.py:345 [alice] -- Sending send data to seq_id 7 of bob from 5#0 without credentials. 2024-03-04T14:09:27.924967511+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:26 DEBUG barriers.py:356 [alice] -- Succeeded to send send data to seq_id 7 of bob from 5#0. Response is True 2024-03-04T14:09:27.924974551+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:26 DEBUG link.py:91 [alice] -- Getting data for 7 from 6#0 of bob 2024-03-04T14:09:27.924980021+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:26 DEBUG link.py:104 [alice] -- Received data for ping from ping. 2024-03-04T14:09:27.924985955+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:26 DEBUG link.py:104 [alice] -- Received data for 7 from 6#0. 2024-03-04T14:09:27.924991405+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:26 DEBUG link.py:110 [alice] -- Getted data for 7 from 6#0 of bob. 2024-03-04T14:09:27.9249969+08:00 stderr F ^[[2m^[[33m(raylet)^[[0m [2024-03-04 06:09:26,867 I 982 982] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 2024-03-04T14:09:27.925002554+08:00 stderr F ^[[2m^[[33m(raylet)^[[0m [2024-03-04 06:09:26,918 I 981 981] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 2024-03-04T14:09:27.92500782+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:27 DEBUG barriers.py:345 [alice] -- Sending send data to seq_id 10 of bob from 8#0 without credentials. 2024-03-04T14:09:27.92503368+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:27 DEBUG barriers.py:356 [alice] -- Succeeded to send send data to seq_id 10 of bob from 8#0. Response is True 2024-03-04T14:09:27.925039234+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:27 DEBUG link.py:91 [alice] -- Getting data for 10 from 9#0 of bob 2024-03-04T14:09:27.925044989+08:00 stderr F 2024-03-04 06:09:27 INFO cleanup.py:77 [alice] -- Notify check sending thread to exit. 2024-03-04T14:09:27.977195594+08:00 stderr F 2024-03-04 06:09:27 INFO cleanup.py:106 [alice] -- Check sending thread was exited. 2024-03-04T14:09:27.977411357+08:00 stderr F 2024-03-04 06:09:27 INFO api.py:219 [alice] -- Shutdowned rayfed. 2024-03-04T14:09:28.88474285+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:27 DEBUG link.py:104 [alice] -- Received data for 10 from 9#0. 2024-03-04T14:09:28.884755382+08:00 stderr F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m 2024-03-04 06:09:27 DEBUG link.py:110 [alice] -- Getted data for 10 from 9#0 of bob. 2024-03-04T14:09:28.884757885+08:00 stderr F 2024-03-04 06:09:28 WARNING entry.py:125 [alice] -- 2024-03-04T14:09:28.884759794+08:00 stderr F -- 2024-03-04T14:09:28.884761715+08:00 stderr F res 2024-03-04T14:09:28.884763424+08:00 stderr F 2024-03-04T14:09:28.884765201+08:00 stderr F outputs { 2024-03-04T14:09:28.884767361+08:00 stderr F name: "psi-output.csv" 2024-03-04T14:09:28.884769836+08:00 stderr F type: "sf.table.vertical_table" 2024-03-04T14:09:28.884771603+08:00 stderr F system_info { 2024-03-04T14:09:28.884773377+08:00 stderr F } 2024-03-04T14:09:28.884775267+08:00 stderr F meta { 2024-03-04T14:09:28.884777301+08:00 stderr F type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable" 2024-03-04T14:09:28.884790816+08:00 stderr F value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\003int\020\244M" 2024-03-04T14:09:28.884796281+08:00 stderr F } 2024-03-04T14:09:28.884798094+08:00 stderr F data_refs { 2024-03-04T14:09:28.884800083+08:00 stderr F uri: "psi-output.csv" 2024-03-04T14:09:28.884801893+08:00 stderr F party: "alice" 2024-03-04T14:09:28.884803565+08:00 stderr F format: "csv" 2024-03-04T14:09:28.884805446+08:00 stderr F } 2024-03-04T14:09:28.884807233+08:00 stderr F data_refs { 2024-03-04T14:09:28.884809113+08:00 stderr F uri: "psi-output.csv" 2024-03-04T14:09:28.884810923+08:00 stderr F party: "bob" 2024-03-04T14:09:28.884812732+08:00 stderr F format: "csv" 2024-03-04T14:09:28.884814556+08:00 stderr F } 2024-03-04T14:09:28.884816305+08:00 stderr F } 2024-03-04T14:09:28.884817884+08:00 stderr F 2024-03-04T14:09:28.884819651+08:00 stderr F -- 2024-03-04T14:09:28.884827044+08:00 stderr F 2024-03-04T14:09:28.915774931+08:00 stderr F 2024-03-04 06:09:28 INFO entry.py:288 [alice] -- Succeeded to run component. 2024-03-04T14:09:28.916718876+08:00 stderr F 2024-03-04 06:09:28 INFO cleanup.py:77 [alice] -- Notify check sending thread to exit. 2024-03-04T14:09:28.917581925+08:00 stdout F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m [2024-03-04 06:09:20.684] [info] [default_brpc_retry_policy.cc:29] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[7d7272454703484b];[content-length]:[145];[kuscia-error-message]:[Domain alice.root-kuscia-lite-alice<--Domain bob.root-kuscia-lite-bob<--192.168.50.100 return http code 503.];[x-accel-buffering]:[no];[x-b3-spanid]:[7d7272454703484b];[x-envoy-upstream-service-time]:[2];[date]:[Mon, 04 Mar 2024 06:09:20 GMT];[server]:[envoy];', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111' 2024-03-04T14:09:28.917585338+08:00 stdout F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m [2024-03-04 06:09:20.684] [info] [default_brpc_retry_policy.cc:75] aggressive retry, sleep=1000000us and retry 2024-03-04T14:09:28.917587789+08:00 stdout F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m [2024-03-04 06:09:21.686] [info] [default_brpc_retry_policy.cc:29] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[e9a9be85241c6897];[content-length]:[145];[kuscia-error-message]:[Domain alice.root-kuscia-lite-alice<--Domain bob.root-kuscia-lite-bob<--192.168.50.100 return http code 503.];[x-accel-buffering]:[no];[x-b3-spanid]:[e9a9be85241c6897];[x-envoy-upstream-service-time]:[1];[date]:[Mon, 04 Mar 2024 06:09:21 GMT];[server]:[envoy];', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 [R1][E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111' 2024-03-04T14:09:28.917589603+08:00 stdout F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m [2024-03-04 06:09:21.686] [info] [default_brpc_retry_policy.cc:75] aggressive retry, sleep=1000000us and retry 2024-03-04T14:09:28.917592989+08:00 stdout F ^[[2m^[[36m(SenderReceiverProxyActor pid=663)^[[0m [2024-03-04 06:09:22.686] [info] [default_brpc_retry_policy.cc:69] not retry for reached rcp timeout, ErrorCode '1008', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 [R1][E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 [R2][E1008]Reached timeout=2000ms @172.18.0.2:80' 2024-03-04T14:09:28.917595008+08:00 stdout F ^[[2m^[[36m(SPURuntime pid=835)^[[0m 2024-03-04 06:09:25.226 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[669b78f756cc9736];[content-length]:[145];[kuscia-error-message]:[Domain alice.root-kuscia-lite-alice<--Domain bob.root-kuscia-lite-bob<--192.168.50.100 return http code 503.];[x-accel-buffering]:[no];[x-b3-spanid]:[669b78f756cc9736];[x-envoy-upstream-service-time]:[2];[date]:[Mon, 04 Mar 2024 06:09:25 GMT];[server]:[envoy];', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111' 2024-03-04T14:09:28.917596863+08:00 stdout F ^[[2m^[[36m(SPURuntime pid=835)^[[0m 2024-03-04 06:09:25.226 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry 2024-03-04T14:09:28.917598712+08:00 stdout F ^[[2m^[[36m(SPURuntime(device_id=None, party=alice) pid=835)^[[0m 2024-03-04 06:09:26.235 [info] [bucket_psi.cc:Init:315] bucket size set to 1048576 2024-03-04T14:09:28.917603996+08:00 stdout F ^[[2m^[[36m(SPURuntime(device_id=None, party=alice) pid=835)^[[0m 2024-03-04 06:09:26.235 [info] [bucket_psi.cc:CheckInput:229] Begin sanity check for input file: /home/kuscia/var/storage/data/alice.csv, precheck_switch:true 2024-03-04T14:09:28.917611188+08:00 stdout F ^[[2m^[[36m(SPURuntime(device_id=None, party=alice) pid=835)^[[0m 2024-03-04 06:09:26.245 [info] [csv_checker.cc:CsvChecker:121] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/home/kuscia/var/storage/data --stable selected-keys.1709532566236242913 | LC_ALL=C uniq -d > duplicate-keys.1709532566236242913 2024-03-04T14:09:28.917613385+08:00 stdout F ^[[2m^[[36m(SPURuntime(device_id=None, party=alice) pid=835)^[[0m 2024-03-04 06:09:26.282 [info] [bucket_psi.cc:CheckInput:246] End sanity check for input file: /home/kuscia/var/storage/data/alice.csv, size=9892 2024-03-04T14:09:28.917615677+08:00 stdout F ^[[2m^[[36m(SPURuntime(device_id=None, party=alice) pid=835)^[[0m 2024-03-04 06:09:26.282 [info] [bucket_psi.cc:Run:183] Skip doing psi, because dataset has been aligned! 2024-03-04T14:09:28.917617622+08:00 stdout F ^[[2m^[[36m(SPURuntime(device_id=None, party=alice) pid=835)^[[0m 2024-03-04 06:09:26.282 [info] [bucket_psi.cc:ProduceOutput:267] Begin post filtering, indices.size=9892, should_sort=false 2024-03-04T14:09:28.917619562+08:00 stdout F ^[[2m^[[36m(SPURuntime(device_id=None, party=alice) pid=835)^[[0m 2024-03-04 06:09:26.295 [info] [bucket_psi.cc:ProduceOutput:305] End post filtering, in=/home/kuscia/var/storage/data/alice.csv, out=/home/kuscia/var/storage/data/psi-output.csv

gshilei commented 8 months ago

@linushio , image

从上述报错的日志中,发现 平台下发的任务中依赖的组件,在 secretflow 引擎镜像中找不到。应该是版本不匹配导致的。

部署平台,是参考的这个文档吗?建议使用平台镜像 对应的kuscia版本,进行kuscia的部署 https://www.secretflow.org.cn/zh-CN/docs/secretpad/latest/zgnd8oqo5chsqhzm

linushio commented 8 months ago

@linushio , image

从上述报错的日志中,发现 平台下发的任务中依赖的组件,在 secretflow 引擎镜像中找不到。应该是版本不匹配导致的。

部署平台,是参考的这个文档吗?建议使用平台镜像 对应的kuscia版本,进行kuscia的部署 https://www.secretflow.org.cn/zh-CN/docs/secretpad/latest/zgnd8oqo5chsqhzm

用的是这个文档https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.5.0b0/deployment/deploy_master_lite_cn#id5

Chrisdehe commented 8 months ago

@linushio anyway,目前定位到时版本兼容问题,建议kuscia先使用目前版本,secretpad版本更新到最新的。(你的latest是当时下载时最新的,对于现在来说需要更新下)

linushio commented 8 months ago

kuscia使用的是0.5.0b0,secretpad使用的是最新的

Chrisdehe commented 8 months ago

docker images 及 docker ps 都看下

linushio commented 8 months ago

107030ae40e0 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretpad:latest "/bin/sh -c 'java ${…" 2 hours ago Up 2 hours 80/tcp, 9001/tcp, 0.0.0.0:8088->8080/tcp, :::8088->8080/tcp root-kuscia-secretpad 6efa3d5e8f4a secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 "tini -- bin/kuscia …" 2 hours ago Up 2 hours 0.0.0.0:18080->1080/tcp, :::18080->1080/tcp, 0.0.0.0:18082->8082/tcp, :::18082->8082/tcp, 0.0.0.0:13083->8083/tcp, :::13083->8083/tcp root-kuscia-master

Chrisdehe commented 8 months ago

辛苦将kuscia版本更新到最新尝试。

gshilei commented 8 months ago

107030ae40e0 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretpad:latest "/bin/sh -c 'java ${…" 2 hours ago Up 2 hours 80/tcp, 9001/tcp, 0.0.0.0:8088->8080/tcp, :::8088->8080/tcp root-kuscia-secretpad 6efa3d5e8f4a secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 "tini -- bin/kuscia …" 2 hours ago Up 2 hours 0.0.0.0:18080->1080/tcp, :::18080->1080/tcp, 0.0.0.0:18082->8082/tcp, :::18082->8082/tcp, 0.0.0.0:13083->8083/tcp, :::13083->8083/tcp root-kuscia-master

可以在master容器中,用命令看下 sf 的镜像版本: kubectl get appimage secretflow-image -o yaml | grep image -A 3

linushio commented 8 months ago

kuscia更新过版本,但是在数据表这一步会出现异常docker exec -it ${USER}-kuscia-lite-alice curl https://127.0.0.1:8070/api/v1/datamesh/domaindatagrant/create -X POST -H 'content-type: application/json' -d '{"author":"alice","domaindata_id":"alice-table","grant_domain":"bob"}' --cacert var/certs/ca.crt --cert var/certs/ca.crt --key var/certs/ca.key

linushio commented 8 months ago

不好意思,上面 -A 打印的内容有点少,可以换成 -A 10 kubectl get appimage secretflow-image -o yaml | grep image -A 10

你现在用的kuscia镜像是 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest 吗?

linushio commented 8 months ago
 {"apiVersion":"kuscia.secretflow/v1alpha1","kind":"AppImage","metadata":{"annotations":{},"name":"secretflow-image"},"spec":{"configTemplates":{"task-config.conf":"{\n  \"task_id\": \"{{.TASK_ID}}\",\n  \"task_input_config\": \"{{.TASK_INPUT_CONFIG}}\",\n  \"task_cluster_def\": \"{{.TASK_CLUSTER_DEFINE}}\",\n  \"allocated_ports\": \"{{.ALLOCATED_PORTS}}\"\n}\n"},"deployTemplates":[{"name":"secretflow","replicas":1,"spec":{"containers":[{"args":["-c","python -m secretflow.kuscia.entry ./kuscia/task-config.conf"],"command":["sh"],"configVolumeMounts":[{"mountPath":"/root/kuscia/task-config.conf","subPath":"task-config.conf"}],"name":"secretflow","ports":[{"name":"spu","port":20000,"protocol":"GRPC","scope":"Cluster"},{"name":"fed","port":20001,"protocol":"GRPC","scope":"Cluster"},{"name":"global","port":20002,"protocol":"GRPC","scope":"Domain"},{"name":"node-manager","port":20003,"protocol":"GRPC","scope":"Local"},{"name":"object-manager","port":20004,"protocol":"GRPC","scope":"Local"},{"name":"client-server","port":20005,"protocol":"GRPC","scope":"Local"}],"workingDir":"/root"}],"restartPolicy":"Never"}}],"image":{"id":"abc","name":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8","sign":"abc","tag":"1.3.0.dev20231120"}}}

creationTimestamp: "2024-03-04T05:51:55Z" generation: 1 name: secretflow-image resourceVersion: "322" uid: 7c47ce32-c302-4e71-9ee3-e30fb7c55785 spec:

image: id: abc name: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8 sign: abc [root@txl-kuscia-master templates]# kubectl get appimage secretflow-image -o yaml | grep image -A 10 {"apiVersion":"kuscia.secretflow/v1alpha1","kind":"AppImage","metadata":{"annotations":{},"name":"secretflow-image"},"spec":{"configTemplates":{"task-config.conf":"{\n \"task_id\": \"{{.TASK_ID}}\",\n \"task_input_config\": \"{{.TASK_INPUT_CONFIG}}\",\n \"task_cluster_def\": \"{{.TASK_CLUSTER_DEFINE}}\",\n \"allocated_ports\": \"{{.ALLOCATED_PORTS}}\"\n}\n"},"deployTemplates":[{"name":"secretflow","replicas":1,"spec":{"containers":[{"args":["-c","python -m secretflow.kuscia.entry ./kuscia/task-config.conf"],"command":["sh"],"configVolumeMounts":[{"mountPath":"/root/kuscia/task-config.conf","subPath":"task-config.conf"}],"name":"secretflow","ports":[{"name":"spu","port":20000,"protocol":"GRPC","scope":"Cluster"},{"name":"fed","port":20001,"protocol":"GRPC","scope":"Cluster"},{"name":"global","port":20002,"protocol":"GRPC","scope":"Domain"},{"name":"node-manager","port":20003,"protocol":"GRPC","scope":"Local"},{"name":"object-manager","port":20004,"protocol":"GRPC","scope":"Local"},{"name":"client-server","port":20005,"protocol":"GRPC","scope":"Local"}],"workingDir":"/root"}],"restartPolicy":"Never"}}],"image":{"id":"abc","name":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8","sign":"abc","tag":"1.3.0.dev20231120"}}} creationTimestamp: "2024-03-04T05:51:55Z" generation: 1 name: secretflow-image resourceVersion: "322" uid: 7c47ce32-c302-4e71-9ee3-e30fb7c55785 spec: configTemplates: task-config.conf: | { "task_id": "{{.TASK_ID}}", "task_input_config": "{{.TASK_INPUT_CONFIG}}", "task_cluster_def": "{{.TASK_CLUSTER_DEFINE}}", "allocated_ports": "{{.ALLOCATED_PORTS}}"

image: id: abc name: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8 sign: abc tag: 1.3.0.dev20231120

linushio commented 8 months ago

我现在用的是kusica:0.5.0b0,secretpad:latest

Chrisdehe commented 8 months ago

内置的sf版本较低,最新版的secretpad兼容sf版本为:secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 请更新下sf版本

linushio commented 8 months ago

更新deploy.sh脚本的这里就好了吧?我等下重新部署一下看看 image

gshilei commented 8 months ago

我现在用的是kusica:0.5.0b0,secretpad:latest

现在问题是,kusica:0.5.0b0 版本对应的 secretflow 版本 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120 和 secretpad 平台的latest版本不匹配。需要升级secretflow 版本。

现在有两种可能可行的方法:

  1. 使用kuscia 最新的镜像 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest
  2. 升级secretflow版本到secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0

要不先尝试升级下secretflow版本:

  1. docker pull secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0
  2. docker save secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0 -o sf-130b0.tar
  3. docker cp sf-130b0.tar $USER-kuscia-lite-alice:/home/kuscia docker cp sf-130b0.tar $USER-kuscia-lite-bob:/home/kuscia
  4. docker exec $USER-kuscia-lite-alice ctr -a=/home/kuscia/containerd/run/containerd.sock -n=k8s.io images import sf-130b0.tar
  5. docker exec $USER-kuscia-lite-bob ctr -a=/home/kuscia/containerd/run/containerd.sock -n=k8s.io images import sf-130b0.tar
  6. 登陆到 master 容器中,修改appimage 中sf镜像版本为1.3.0b0 kubectl edit appimage secretflow-image
linushio commented 8 months ago

我现在用的是kusica:0.5.0b0,secretpad:latest

现在问题是,kusica:0.5.0b0 版本对应的 secretflow 版本 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120 和 secretpad 平台的latest版本不匹配。需要升级secretflow 版本。

现在有两种可能可行的方法:

  1. 使用kuscia 最新的镜像 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest
  2. 升级secretflow版本到secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0

要不先尝试升级下secretflow版本:

  1. docker pull secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0
  2. docker save secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0 -o sf-130b0.tar
  3. docker cp sf-130b0.tar $USER-kuscia-lite-alice:/home/kuscia docker cp sf-130b0.tar $USER-kuscia-lite-bob:/home/kuscia
  4. docker exec $USER-kuscia-lite-alice ctr -a=/home/kuscia/containerd/run/containerd.sock -n=k8s.io images import sf-130b0.tar
  5. docker exec $USER-kuscia-lite-bob ctr -a=/home/kuscia/containerd/run/containerd.sock -n=k8s.io images import sf-130b0.tar
  6. 登陆到 master 容器中,修改appimage 中sf镜像版本为1.3.0b0 kubectl edit appimage secretflow-image

更换1.3.0b0,本地作业都失败了,我现在试试latest

gshilei commented 8 months ago

我现在用的是kusica:0.5.0b0,secretpad:latest

现在问题是,kusica:0.5.0b0 版本对应的 secretflow 版本 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120 和 secretpad 平台的latest版本不匹配。需要升级secretflow 版本。 现在有两种可能可行的方法:

  1. 使用kuscia 最新的镜像 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest
  2. 升级secretflow版本到secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0

要不先尝试升级下secretflow版本:

  1. docker pull secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0
  2. docker save secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0 -o sf-130b0.tar
  3. docker cp sf-130b0.tar $USER-kuscia-lite-alice:/home/kuscia docker cp sf-130b0.tar $USER-kuscia-lite-bob:/home/kuscia
  4. docker exec $USER-kuscia-lite-alice ctr -a=/home/kuscia/containerd/run/containerd.sock -n=k8s.io images import sf-130b0.tar
  5. docker exec $USER-kuscia-lite-bob ctr -a=/home/kuscia/containerd/run/containerd.sock -n=k8s.io images import sf-130b0.tar
  6. 登陆到 master 容器中,修改appimage 中sf镜像版本为1.3.0b0 kubectl edit appimage secretflow-image

更换1.3.0b0,本地作业都失败了,我现在试试latest

试一下这个镜像:secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601

haha-zwx-ooo commented 8 months ago

secretpad latest版本 匹配 kuscia: secretflow/kuscia:0.5.0b0 secretflow: secretflow/secretflow-lite-anolis8:1.4.0.dev24011601

linushio commented 8 months ago

Contributor

修改后本地测试作业运行失败 2024-03-04 17:36:16.796 INFO status/status_manager.go:625 Patch status for pod "secretflow-task-20240304173606-single-psi-0_alice(a682e318-d21c-4646-a3a8-a405ff18fbe3)", patch={"metadata":{"uid":"a682e318-d21c-4646-a3a8-a405ff18fbe3"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastTransitionTime":"2024-03-04T09:36:16Z","reason":"PodFailed","status":"False","type":"Ready"},{"lastTransitionTime":"2024-03-04T09:36:16Z","reason":"PodFailed","status":"False","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://0ee0029ee57045ecd18d1e364fda690d685f8f7406abdf980518450cc36fe28b","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601","imageID":"sha256:86495c6fde3238e2c17702340900ddf429b5dd76b79976fffab82d56de38efea","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://0ee0029ee57045ecd18d1e364fda690d685f8f7406abdf980518450cc36fe28b","exitCode":1,"finishedAt":"2024-03-04T09:36:15Z","message":" name: \"min_frequency\"\n desc: \"Specifies the minimum frequency below which a category will be considered infrequent, [0, 1), 0 disable\"\n type: AT_FLOAT\n atomic {\n is_optional: true\n default_value {\n }\n lower_bound_enabled: true\n lower_bound {\n }\n lower_bound_inclusive: true\n upper_bound_enabled: true\n upper_bound {\n f: 1.0\n }\n }\n }\n attrs {\n name: \"report_rules\"\n desc: \"Whether to report rule details\"\n type: AT_BOOL\n atomic {\n is_optional: true\n default_value {\n b: true\n }\n }\n }\n inputs {\n name: \"input_dataset\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"features\"\n desc: \"Features to encode.\"\n }\n }\n outputs {\n name: \"output_dataset\"\n desc: \"output_dataset\"\n types: \"sf.table.vertical_table\"\n }\n outputs {\n name: \"out_rules\"\n desc: \"onehot rule\"\n types: \"sf.rule.preprocessing\"\n }\n outputs {\n name: \"report\"\n desc: \"report rules details if report_rules is true\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"preprocessing\"\n name: \"substitution\"\n desc: \"unified substitution component\"\n version: \"0.0.2\"\n inputs {\n name: \"input_dataset\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n }\n inputs {\n name: \"input_rules\"\n desc: \"Input preprocessing rules\"\n types: \"sf.rule.preprocessing\"\n }\n outputs {\n name: \"output_dataset\"\n desc: \"output_dataset\"\n types: \"sf.table.vertical_table\"\n }\n}\ncomps {\n domain: \"preprocessing\"\n name: \"vert_bin_substitution\"\n desc: \"Substitute datasets\' value by bin substitution rules.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Vertical partitioning dataset to be substituted.\"\n types: \"sf.table.vertical_table\"\n }\n inputs {\n name: \"bin_rule\"\n desc: \"Input bin substitution rule.\"\n types: \"sf.rule.binning\"\n }\n outputs {\n name: \"output_data\"\n desc: \"Output vertical table.\"\n types: \"sf.table.vertical_table\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"groupby_statistics\"\n desc: \"Get a groupby of statistics, like pandas groupby statistics.\nCurrently only support VDataframe.\"\n version: \"0.0.3\"\n attrs {\n name: \"aggregation_config\"\n desc: \"input groupby aggregation config\"\n type: AT_CUSTOM_PROTOBUF\n custom_protobuf_cls: \"groupby_aggregation_config_pb2.GroupbyAggregationConfig\"\n }\n attrs {\n name: \"max_group_size\"\n desc: \"The maximum number of groups allowed\"\n type: AT_INT\n atomic {\n is_optional: true\n default_value {\n i64: 10000\n }\n lower_bound_enabled: true\n lower_bound {\n }\n upper_bound_enabled: true\n upper_bound {\n i64: 10001\n }\n }\n }\n inputs {\n name: \"input_data\"\n desc: \"Input table.\"\n types: \"sf.table.vertical_table\"\n types: \"sf.table.individual\"\n attrs {\n name: \"by\"\n desc: \"by what columns should we group the values\"\n col_min_cnt_inclusive: 1\n col_max_cnt_inclusive: 4\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output groupby statistics report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"ss_pearsonr\"\n desc: \"Calculate Pearson\'s product-moment correlation coefficient for vertical partitioning dataset\nby using secret sharing.\n- For large dataset(large than 10w samples \u0026 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"feature_selects\"\n desc: \"Specify which features to calculate correlation coefficient with. If empty, all features will be used\"\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output Pearson\'s product-moment correlation coefficient report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"ss_vif\"\n desc: \"Calculate Variance Inflation Factor(VIF) for vertical partitioning dataset\nby using secret sharing.\n- For large dataset(large than 10w samples \u0026 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"feature_selects\"\n desc: \"Specify which features to calculate VIF with. If empty, all features will be used.\"\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output Variance Inflation Factor(VIF) report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"table_statistics\"\n desc: \"Get a table of statistics,\nincluding each column\'s\n1. datatype\n2. total_count\n3. count\n4. count_na\n5. na_ratio\n6. min\n7. max\n8. mean\n9. var\n10. std\n11. sem\n12. skewness\n13. kurtosis\n14. q1\n15. q2\n16. q3\n17. moment_2\n18. moment_3\n19. moment_4\n20. central_moment_2\n21. central_moment_3\n22. central_moment_4\n23. sum\n24. sum_2\n25. sum_3\n26. sum_4\n- moment_2 means E[X^2].\n- central_moment_2 means E[(X - mean(X))^2].\n- sum_2 means sum(X^2).\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input table.\"\n types: \"sf.table.vertical_table\"\n types: \"sf.table.individual\"\n }\n outputs {\n name: \"report\"\n desc: \"Output table statistics report.\"\n types: \"sf.report\"\n }\n}\n\n","reason":"Error","startedAt":"2024-03-04T09:36:09Z"}}}]}}

linushio commented 8 months ago

我注意到deploy.sh中还有两处kuscia的镜像地址,是不是也要修改成0.5.0b0 if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}"

if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 fi log "SECRETFLOW_IMAGE=${SECRETFLOW_IMAGE}"

SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"

gshilei commented 8 months ago

我注意到deploy.sh中还有两处kuscia的镜像地址,是不是也要修改成0.5.0b0 if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}"

if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 fi log "SECRETFLOW_IMAGE=${SECRETFLOW_IMAGE}"

SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"

嗯,kuscia版本不对,需要改成0.5.0b0

KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0

SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601

linushio commented 8 months ago

部署最新版本,按照文档来有几个问题:

  1. 下载数据集时,latest镜像中文档对应地址没有alice.csv/bob.csv数据集
  2. 创建测试数据表授权时,并没有示例文档中的这个token,我将token这个header删除后运行无报错、无输出
  3. 运行示例job时,报错
linushio commented 8 months ago

我注意到deploy.sh中还有两处kuscia的镜像地址,是不是也要修改成0.5.0b0 if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}" if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 fi log "SECRETFLOW_IMAGE=${SECRETFLOW_IMAGE}" SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"

嗯,kuscia版本不对,需要改成0.5.0b0

KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0

SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601

SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"这个也要改吗?

gshilei commented 8 months ago

我注意到deploy.sh中还有两处kuscia的镜像地址,是不是也要修改成0.5.0b0 if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}" if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 fi log "SECRETFLOW_IMAGE=${SECRETFLOW_IMAGE}" SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"

嗯,kuscia版本不对,需要改成0.5.0b0 KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601

SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"这个也要改吗?

SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow" 不用改

linushio commented 8 months ago

我注意到deploy.sh中还有两处kuscia的镜像地址,是不是也要修改成0.5.0b0 if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}" if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 fi log "SECRETFLOW_IMAGE=${SECRETFLOW_IMAGE}" SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"

嗯,kuscia版本不对,需要改成0.5.0b0 KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601

SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"这个也要改吗?

SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow" 不用改

本地测试作业依然报错: 2024-03-04 17:49:23.283 INFO status/status_manager.go:625 Patch status for pod "secretflow-task-20240304174913-single-psi-0_alice(48587eff-432b-472e-b5e2-af04eded6fb6)", patch={"metadata":{"uid":"48587eff-432b-472e-b5e2-af04eded6fb6"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastTransitionTime":"2024-03-04T09:49:23Z","reason":"PodFailed","status":"False","type":"Ready"},{"lastTransitionTime":"2024-03-04T09:49:23Z","reason":"PodFailed","status":"False","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://56814abe86a308117f76fdbf28ae9abf975f80b719b700f11a39ecdeb6d31035","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601","imageID":"sha256:86495c6fde3238e2c17702340900ddf429b5dd76b79976fffab82d56de38efea","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://56814abe86a308117f76fdbf28ae9abf975f80b719b700f11a39ecdeb6d31035","exitCode":1,"finishedAt":"2024-03-04T09:49:22Z","message":" name: \"min_frequency\"\n desc: \"Specifies the minimum frequency below which a category will be considered infrequent, [0, 1), 0 disable\"\n type: AT_FLOAT\n atomic {\n is_optional: true\n default_value {\n }\n lower_bound_enabled: true\n lower_bound {\n }\n lower_bound_inclusive: true\n upper_bound_enabled: true\n upper_bound {\n f: 1.0\n }\n }\n }\n attrs {\n name: \"report_rules\"\n desc: \"Whether to report rule details\"\n type: AT_BOOL\n atomic {\n is_optional: true\n default_value {\n b: true\n }\n }\n }\n inputs {\n name: \"input_dataset\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"features\"\n desc: \"Features to encode.\"\n }\n }\n outputs {\n name: \"output_dataset\"\n desc: \"output_dataset\"\n types: \"sf.table.vertical_table\"\n }\n outputs {\n name: \"out_rules\"\n desc: \"onehot rule\"\n types: \"sf.rule.preprocessing\"\n }\n outputs {\n name: \"report\"\n desc: \"report rules details if report_rules is true\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"preprocessing\"\n name: \"substitution\"\n desc: \"unified substitution component\"\n version: \"0.0.2\"\n inputs {\n name: \"input_dataset\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n }\n inputs {\n name: \"input_rules\"\n desc: \"Input preprocessing rules\"\n types: \"sf.rule.preprocessing\"\n }\n outputs {\n name: \"output_dataset\"\n desc: \"output_dataset\"\n types: \"sf.table.vertical_table\"\n }\n}\ncomps {\n domain: \"preprocessing\"\n name: \"vert_bin_substitution\"\n desc: \"Substitute datasets\' value by bin substitution rules.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Vertical partitioning dataset to be substituted.\"\n types: \"sf.table.vertical_table\"\n }\n inputs {\n name: \"bin_rule\"\n desc: \"Input bin substitution rule.\"\n types: \"sf.rule.binning\"\n }\n outputs {\n name: \"output_data\"\n desc: \"Output vertical table.\"\n types: \"sf.table.vertical_table\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"groupby_statistics\"\n desc: \"Get a groupby of statistics, like pandas groupby statistics.\nCurrently only support VDataframe.\"\n version: \"0.0.3\"\n attrs {\n name: \"aggregation_config\"\n desc: \"input groupby aggregation config\"\n type: AT_CUSTOM_PROTOBUF\n custom_protobuf_cls: \"groupby_aggregation_config_pb2.GroupbyAggregationConfig\"\n }\n attrs {\n name: \"max_group_size\"\n desc: \"The maximum number of groups allowed\"\n type: AT_INT\n atomic {\n is_optional: true\n default_value {\n i64: 10000\n }\n lower_bound_enabled: true\n lower_bound {\n }\n upper_bound_enabled: true\n upper_bound {\n i64: 10001\n }\n }\n }\n inputs {\n name: \"input_data\"\n desc: \"Input table.\"\n types: \"sf.table.vertical_table\"\n types: \"sf.table.individual\"\n attrs {\n name: \"by\"\n desc: \"by what columns should we group the values\"\n col_min_cnt_inclusive: 1\n col_max_cnt_inclusive: 4\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output groupby statistics report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"ss_pearsonr\"\n desc: \"Calculate Pearson\'s product-moment correlation coefficient for vertical partitioning dataset\nby using secret sharing.\n- For large dataset(large than 10w samples \u0026 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"feature_selects\"\n desc: \"Specify which features to calculate correlation coefficient with. If empty, all features will be used\"\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output Pearson\'s product-moment correlation coefficient report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"ss_vif\"\n desc: \"Calculate Variance Inflation Factor(VIF) for vertical partitioning dataset\nby using secret sharing.\n- For large dataset(large than 10w samples \u0026 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"feature_selects\"\n desc: \"Specify which features to calculate VIF with. If empty, all features will be used.\"\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output Variance Inflation Factor(VIF) report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"table_statistics\"\n desc: \"Get a table of statistics,\nincluding each column\'s\n1. datatype\n2. total_count\n3. count\n4. count_na\n5. na_ratio\n6. min\n7. max\n8. mean\n9. var\n10. std\n11. sem\n12. skewness\n13. kurtosis\n14. q1\n15. q2\n16. q3\n17. moment_2\n18. moment_3\n19. moment_4\n20. central_moment_2\n21. central_moment_3\n22. central_moment_4\n23. sum\n24. sum_2\n25. sum_3\n26. sum_4\n- moment_2 means E[X^2].\n- central_moment_2 means E[(X - mean(X))^2].\n- sum_2 means sum(X^2).\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input table.\"\n types: \"sf.table.vertical_table\"\n types: \"sf.table.individual\"\n }\n outputs {\n name: \"report\"\n desc: \"Output table statistics report.\"\n types: \"sf.report\"\n }\n}\n\n","reason":"Error","startedAt":"2024-03-04T09:49:17Z"}}}]}}

linushio commented 8 months ago

if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}"

if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601

gshilei commented 8 months ago

if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}"

if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601

上面报错是平台上下的作业吗?

linushio commented 8 months ago

是的docker exec -it ${USER}-kuscia-master scripts/user/create_example_job.sh 目前,我试了很多版本,只有kuscia改成0.5.0b0版本,其他不动可以跑通这个作业,但是secretpad平台上运行psi也是失败的 只在这一步修改:export KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0

gshilei commented 8 months ago

是的docker exec -it ${USER}-kuscia-master scripts/user/create_example_job.sh 目前,我试了很多版本,只有kuscia改成0.5.0b0版本,其他不动可以跑通这个作业,但是secretpad平台上运行psi也是失败的 只在这一步修改:export KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0

因为改了 sf 的版本,所以kuscia 自带的示例是有可能失败的。 所以需要从平台下发任务。

  1. 下发任务时,检查下appimage 中版本对应的是secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601。
  2. 如果任务失败,可以按照上面教程,把引擎的日志贴一下,看下是不是之前遇到的问题。
linushio commented 8 months ago

感谢!secretpad跑通了 image