Closed linushio closed 8 months ago
通过修改脚本后从其他服务器获得证书,成功部署secretpad后,psi算法报错,这里是alice节点的容器日志
":"b2010715-6bbf-4001-90e6-04014fc4ef20"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastTransitionTime":"2024-03-02T05:58:00Z","reason":"PodFailed","status":"False","type":"Ready"},{"lastTransitionTime":"2024-03-02T05:58:00Z","reason":"PodFailed","status":"False","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://bd550f8449efb75a8b893aef3f0aad9774626959f5216af00343e6d0aebba8b3","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120","imageID":"sha256:f1c20d8cb5c4c69d3997527e4912e794ba3cd7fa26bfaf6afa1383697c80ea9a","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://bd550f8449efb75a8b893aef3f0aad9774626959f5216af00343e6d0aebba8b3","exitCode":1,"finishedAt":"2024-03-02T05:58:00Z","message":"WARNING:root:Since the GPL-licensed package unidecode
is not installed, using Python's unicodedata
package which yields worse results.\n2024-03-02 05:57:56,466|alice|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='jtef-wfpdsawa-node-3-0-global.alice.svc', ray_node_manager_port=26768, ray_object_manager_port=26769, ray_client_server_port=26770, ray_worker_ports=[], ray_gcs_port=26767)\n2024-03-02 05:57:56,466|alice|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at jtef-wfpdsawa-node-3-0-global.alice.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=24 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=jtef-wfpdsawa-node-3-0-global.alice.svc --port=26767 --node-manager-port=26768 --object-manager-port=26769 --ray-client-server-port=26770\n2024-03-02 05:58:00,032|alice|INFO|secretflow|entry.py:start_ray:76| 2024-03-02 05:57:57,002\tINFO usage_lib.py:490 -- Usage stats collection is disabled.\n2024-03-02 05:57:57,002\tINFO scripts.py:702 -- Local node IP: jtef-wfpdsawa-node-3-0-global.alice.svc\n2024-03-02 05:57:59,898\tSUCC scripts.py:739 -- --------------------\n2024-03-02 05:57:59,898\tSUCC scripts.py:740 -- Ray runtime started.\n2024-03-02 05:57:59,898\tSUCC scripts.py:741 -- --------------------\n2024-03-02 05:57:59,898\tINFO scripts.py:743 -- Next steps\n2024-03-02 05:57:59,899\tINFO scripts.py:744 -- To connect to this Ray runtime from another node, run\n2024-03-02 05:57:59,899\tINFO scripts.py:747 -- ray start --address='jtef-wfpdsawa-node-3-0-global.alice.svc:26767'\n2024-03-02 05:57:59,899\tINFO scripts.py:763 -- Alternatively, use the following Python code:\n2024-03-02 05:57:59,899\tINFO scripts.py:765 -- import ray\n2024-03-02 05:57:59,899\tINFO scripts.py:769 -- ray.init(address='auto', _node_ip_address='jtef-wfpdsawa-node-3-0-global.alice.svc')\n2024-03-02 05:57:59,899\tINFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to\n2024-03-02 05:57:59,899\tINFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following\n2024-03-02 05:57:59,899\tINFO scripts.py:789 -- Python code:\n2024-03-02 05:57:59,899\tINFO scripts.py:791 -- import ray\n2024-03-02 05:57:59,899\tINFO scripts.py:792 -- ray.init(address='ray://\u003chead_node_ip_address\u003e:26770')\n2024-03-02 05:57:59,899\tINFO scripts.py:801 -- To see the status of the cluster, use\n2024-03-02 05:57:59,899\tINFO scripts.py:802 -- ray status\n2024-03-02 05:57:59,899\tINFO scripts.py:812 -- If connection fails, check your firewall settings and network configuration.\n2024-03-02 05:57:59,899\tINFO scripts.py:820 -- To terminate the Ray runtime, run\n2024-03-02 05:57:59,899\tINFO scripts.py:821 -- ray stop\n\n2024-03-02 05:58:00,033|alice|INFO|secretflow|entry.py:start_ray:77| Succeeded to start ray head node at jtef-wfpdsawa-node-3-0-global.alice.svc.\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.8/runpy.py\", line 194, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File \"/usr/local/lib/python3.8/runpy.py\", line 87, in _run_code\n exec(code, run_globals)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 294, in \u003cmodule\u003e\n main()\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1157, in call\n return self.main(args, kwargs)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1078, in main\n rv = self.invoke(ctx)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, ctx.params)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 783, in invoke\n return __callback(args, **kwargs)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 261, in main\n sf_node_eval_param = preprocess_sf_node_eval_param(\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 92, in preprocess_sf_node_eval_param\n comp_def = get_comp_def(param.domain, param.name, param.version)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/component/entry.py\", line 104, in get_comp_def\n assert key in COMP_MAP\nAssertionError\n","reason":"Error","startedAt":"2024-03-02T05:57:54Z"}}}]}}
在alice、bob服务器分别增加脚本 docker exec -it root-kuscia-lite-alice sh scripts/deploy/init_kusciaapi_client_certs.sh docker cp root-kuscia-lite-alice:/home/kuscia/var/certs/kusciaapi-client.crt ./certs/client.crt docker cp root-kuscia-lite-alice:/home/kuscia/var/certs/kusciaapi-client.key ./certs/client.pem docker cp root-kuscia-lite-alice:/home/kuscia/var/certs/token ./certs/token docker cp root-kuscia-lite-alice:/home/kuscia/var/certs/ca.crt ./certs/ca.crt sudo scp ./certs/* app@192.168.50.192:/home/app/project/secretpad/temps/certs/alice
下边是master容器日志,这个意思是alice的证书不匹配吗? 2024-03-02 15:05:23.603 ERROR controller/regitser_node.go:256 public not match 2024-03-02 15:05:23.605 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-02 15:05:24.622 ERROR controller/regitser_node.go:256 public not match 2024-03-02 15:05:24.622 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-02 15:05:25.641 ERROR controller/regitser_node.go:256 public not match 2024-03-02 15:05:25.641 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-02 15:05:26.661 ERROR controller/regitser_node.go:256 public not match 2024-03-02 15:05:26.661 ERROR controller/regitser_node.go:222 domain alice register failed(token match error) 2024-03-02 15:05:27.680 ERROR controller/regitser_node.go:256 public not match 2024-03-02 15:05:27.680 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
看报错是组件缺少了一些配置,请问你目前执行的任务是什么呢? 另外可以参考下kuscia FAQ是否有匹配的问题。
看报错是组件缺少了一些配置,请问你目前执行的任务是什么呢? 另外可以参考下kuscia FAQ是否有匹配的问题。
是在secretpad执行psi报错,kusica部署版本为0.5.0b0中心化,secretpad为lastest
master错误日志部分:
2024-03-04 11:12:39.118 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
2024-03-04 11:12:40.138 ERROR controller/regitser_node.go:256 public not match
2024-03-04 11:12:40.139 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
2024-03-04 11:12:41.160 ERROR controller/regitser_node.go:256 public not match
2024-03-04 11:12:41.160 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
2024-03-04 11:13:24.563 ERROR service/domaindata_grant.go:125 Query DomainDataGrant failed, error:domaindatagrants.kuscia.secretflow "alice-table-bob" not found
2024-03-04 11:13:24.637 ERROR service/domaindata_grant.go:125 Query DomainDataGrant failed, error:domaindatagrants.kuscia.secretflow "bob-table-alice" not found
2024-03-04 11:13:46.250 ERROR controller/regitser_node.go:256 public not match
2024-03-04 11:13:46.250 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
2024-03-04 11:13:47.267 ERROR controller/regitser_node.go:256 public not match
2024-03-04 11:13:47.267 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
2024-03-04 11:13:48.288 ERROR controller/regitser_node.go:256 public not match
2024-03-04 11:13:48.288 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
2024-03-04 11:13:49.307 ERROR controller/regitser_node.go:256 public not match
2024-03-04 11:13:49.307 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
2024-03-04 11:13:50.326 ERROR controller/regitser_node.go:256 public not match
2024-03-04 11:13:50.326 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
2024-03-04 11:13:50.459 WARN resources/kusciatask.go:76 Failed to update kuscia task "snha-wocixsij-node-3" status since the resource version changed, skip updating it, last updating error: Operation cannot be fulfilled on kusciatasks.kuscia.secretflow "snha-wocixsij-node-3": the object has been modified; please apply your changes to the latest version and try again
2024-03-04 11:13:52.227 WARN resources/kusciatask.go:76 Failed to update kuscia task "snha-wocixsij-node-3" status since the resource version changed, skip updating it, last updating error: Operation cannot be fulfilled on kusciatasks.kuscia.secretflow "snha-wocixsij-node-3": the object has been modified; please apply your changes to the latest version and try again
2024-03-04 11:13:53.388 WARN resources/kusciatask.go:76 Failed to update kuscia task "snha-wocixsij-node-3" status since the resource version changed, skip updating it, last updating error: Operation cannot be fulfilled on kusciatasks.kuscia.secretflow "snha-wocixsij-node-3": the object has been modified; please apply your changes to the latest version and try again
2024-03-04 11:13:53.534 WARN resources/kusciatask.go:76 Failed to update kuscia task "snha-wocixsij-node-3" status since the resource version changed, skip updating it, last updating error: Operation cannot be fulfilled on kusciatasks.kuscia.secretflow "snha-wocixsij-node-3": the object has been modified; please apply your changes to the latest version and try again
2024-03-04 11:13:59.315 WARN resources/kusciatask.go:76 Failed to update kuscia task "snha-wocixsij-node-3" status since the resource version changed, skip updating it, last updating error: Operation cannot be fulfilled on kusciatasks.kuscia.secretflow "snha-wocixsij-node-3": the object has been modified; please apply your changes to the latest version and try again
2024-03-04 11:14:55.455 ERROR controller/regitser_node.go:256 public not match
2024-03-04 11:14:55.455 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
2024-03-04 11:14:56.475 ERROR controller/regitser_node.go:256 public not match
2024-03-04 11:14:56.475 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
2024-03-04 11:14:57.491 ERROR controller/regitser_node.go:256 public not match
2024-03-04 11:14:57.491 ERROR controller/regitser_node.go:222 domain alice register failed(token match error)
2024-03-04 11:14:58.510 ERROR controller/regitser_node.go:256 public not match
alice错误日志为:
2024-03-04 09:58:55.554 WARN controller/domain_route.go:223 request error, path: kuscia-handshake.bob.svc/handshake, code: 503, message: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
2024-03-04 09:59:10.554 WARN controller/domain_route.go:223 request error, path: kuscia-handshake.bob.svc/handshake, code: 503, message: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
2024-03-04 09:59:25.554 WARN controller/domain_route.go:223 request error, path: kuscia-handshake.bob.svc/handshake, code: 503, message: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
2024-03-04 09:59:40.556 WARN controller/domain_route.go:223 request error, path: kuscia-handshake.bob.svc/handshake, code: 503, message: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
2024-03-04 11:13:58.990 INFO status/status_manager.go:625 Patch status for pod "snha-wocixsij-node-3-0_alice(a4e041b3-40d6-4335-9f54-15b4bc478c9b)", patch={"metadata":{"uid":"a4e041b3-40d6-4335-9f54-15b4bc478c9b"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastTransitionTime":"2024-03-04T03:13:58Z","reason":"PodFailed","status":"False","type":"Ready"},{"lastTransitionTime":"2024-03-04T03:13:58Z","reason":"PodFailed","status":"False","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://0061342aa79e83b37615af31472e1243a275babcca5fd817cf93bf9de3461871","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120","imageID":"sha256:f1c20d8cb5c4c69d3997527e4912e794ba3cd7fa26bfaf6afa1383697c80ea9a","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://0061342aa79e83b37615af31472e1243a275babcca5fd817cf93bf9de3461871","exitCode":1,"finishedAt":"2024-03-04T03:13:58Z","message":"WARNING:root:Since the GPL-licensed package unidecode
is not installed, using Python's unicodedata
package which yields worse results.\n2024-03-04 03:13:55,390|alice|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='snha-wocixsij-node-3-0-global.alice.svc', ray_node_manager_port=24086, ray_object_manager_port=24087, ray_client_server_port=24088, ray_worker_ports=[], ray_gcs_port=24091)\n2024-03-04 03:13:55,390|alice|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at snha-wocixsij-node-3-0-global.alice.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=24 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=snha-wocixsij-node-3-0-global.alice.svc --port=24091 --node-manager-port=24086 --object-manager-port=24087 --ray-client-server-port=24088\n2024-03-04 03:13:57,854|alice|INFO|secretflow|entry.py:start_ray:76| 2024-03-04 03:13:55,924\tINFO usage_lib.py:490 -- Usage stats collection is disabled.\n2024-03-04 03:13:55,924\tINFO scripts.py:702 -- Local node IP: snha-wocixsij-node-3-0-global.alice.svc\n2024-03-04 03:13:57,695\tSUCC scripts.py:739 -- --------------------\n2024-03-04 03:13:57,695\tSUCC scripts.py:740 -- Ray runtime started.\n2024-03-04 03:13:57,695\tSUCC scripts.py:741 -- --------------------\n2024-03-04 03:13:57,695\tINFO scripts.py:743 -- Next steps\n2024-03-04 03:13:57,695\tINFO scripts.py:744 -- To connect to this Ray runtime from another node, run\n2024-03-04 03:13:57,695\tINFO scripts.py:747 -- ray start --address='snha-wocixsij-node-3-0-global.alice.svc:24091'\n2024-03-04 03:13:57,695\tINFO scripts.py:763 -- Alternatively, use the following Python code:\n2024-03-04 03:13:57,695\tINFO scripts.py:765 -- import ray\n2024-03-04 03:13:57,695\tINFO scripts.py:769 -- ray.init(address='auto', _node_ip_address='snha-wocixsij-node-3-0-global.alice.svc')\n2024-03-04 03:13:57,695\tINFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to\n2024-03-04 03:13:57,695\tINFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following\n2024-03-04 03:13:57,695\tINFO scripts.py:789 -- Python code:\n2024-03-04 03:13:57,695\tINFO scripts.py:791 -- import ray\n2024-03-04 03:13:57,695\tINFO scripts.py:792 -- ray.init(address='ray://\u003chead_node_ip_address\u003e:24088')\n2024-03-04 03:13:57,695\tINFO scripts.py:801 -- To see the status of the cluster, use\n2024-03-04 03:13:57,695\tINFO scripts.py:802 -- ray status\n2024-03-04 03:13:57,695\tINFO scripts.py:812 -- If connection fails, check your firewall settings and network configuration.\n2024-03-04 03:13:57,695\tINFO scripts.py:820 -- To terminate the Ray runtime, run\n2024-03-04 03:13:57,695\tINFO scripts.py:821 -- ray stop\n\n2024-03-04 03:13:57,854|alice|INFO|secretflow|entry.py:start_ray:77| Succeeded to start ray head node at snha-wocixsij-node-3-0-global.alice.svc.\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.8/runpy.py\", line 194, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File \"/usr/local/lib/python3.8/runpy.py\", line 87, in _run_code\n exec(code, run_globals)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 294, in \u003cmodule\u003e\n main()\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1157, in call\n return self.main(args, kwargs)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1078, in main\n rv = self.invoke(ctx)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, ctx.params)\n File \"/usr/local/lib/python3.8/site-packages/click/core.py\", line 783, in invoke\n return __callback(args, kwargs)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 261, in main\n sf_node_eval_param = preprocess_sf_node_eval_param(\n File \"/usr/local/lib/python3.8/site-packages/secretflow/kuscia/entry.py\", line 92, in preprocess_sf_node_eval_param\n comp_def = get_comp_def(param.domain, param.name, param.version)\n File \"/usr/local/lib/python3.8/site-packages/secretflow/component/entry.py\", line 104, in get_comp_def\n assert key in COMP_MAP\nAssertionError\n","reason":"Error","startedAt":"2024-03-04T03:13:53Z"}}}]}}
bob错误日志为:**
[root@root-kuscia-lite-bob kuscia]# cat /home/kuscia/var/logs/kuscia.log | grep -i error
2024-03-02 15:26:55.797 ERROR controller/handshake.go:792 Handshake to master fail, return error:invalid public key, key is empty
2024-03-02 15:27:40.843 ERROR xds/xds.go:448 unknown cluster: bob-to-alice-http
2024-03-02 18:17:08.242 ERROR xds/xds.go:448 unknown cluster: bob-to-alice-http
2024-03-04 09:32:18.871 ERROR xds/xds.go:448 unknown cluster: bob-to-alice-http
2024-03-04 09:59:42.637 ERROR xds/xds.go:448 unknown cluster: bob-to-alice-http
2024-03-04 11:14:00.905 INFO status/status_manager.go:625 Patch status for pod "snha-wocixsij-node-3-0_bob(34ae7669-f978-42af-99a4-c0b15b1d9678)", patch={"metadata":{"uid":"34ae7669-f978-42af-99a4-c0b15b1d9678"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastTransitionTime":"2024-03-04T03:14:00Z","reason":"PodFailed","status":"False","type":"Ready"},{"lastTransitionTime":"2024-03-04T03:14:00Z","reason":"PodFailed","status":"False","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://dc21842c202f1d40f516d59e6a1c2e3f730c2ea46a51686e55481c13c109172a","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120","imageID":"sha256:f1c20d8cb5c4c69d3997527e4912e794ba3cd7fa26bfaf6afa1383697c80ea9a","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://dc21842c202f1d40f516d59e6a1c2e3f730c2ea46a51686e55481c13c109172a","exitCode":15,"finishedAt":"2024-03-04T03:13:59Z","message":"WARNING:root:Since the GPL-licensed package unidecode
is not installed, using Python's unicodedata
package which yields worse results.\n2024-03-04 03:13:57,245|bob|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='snha-wocixsij-node-3-0-global.bob.svc', ray_node_manager_port=21394, ray_object_manager_port=21395, ray_client_server_port=21390, ray_worker_ports=[], ray_gcs_port=21393)\n2024-03-04 03:13:57,246|bob|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at snha-wocixsij-node-3-0-global.bob.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=40 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=snha-wocixsij-node-3-0-global.bob.svc --port=21393 --node-manager-port=21394 --object-manager-port=21395 --ray-client-server-port=21390\n","reason":"Error","startedAt":"2024-03-04T03:13:53Z"}}}],"phase":"Failed","podIP":null,"podIPs":null}}
部署文档下边的测试作业是运行成功的:
看到报错信息是因为bob的证书问题导致连接失败,辛苦检查一下证书,可以重新申请下
看到报错信息是因为bob的证书问题导致连接失败,辛苦检查一下证书,可以重新申请下
这个具体是哪里的步骤?还有就是上边说的secretpad的问题如何解决?我多次重新部署依然有相同的问题,严格按照文档流程,过程中没有异常
看到报错信息是因为bob的证书问题导致连接失败,辛苦检查一下证书,可以重新申请下
这个具体是哪里的步骤?还有就是上边说的secretpad的问题如何解决?我多次重新部署依然有相同的问题,严格按照文档流程,过程中没有异常
嗯,节点之间的授权可以通过命令 kubectl get cdr 查看。如果显示列表中,最后一列 READY 为 True 表示授权没问题。
此外,你这边执行示例 Job 是运行成功的,但是从平台下发的 Job 是失败的吗? 如果是的话,可以到组件引擎日志目录,把整个日志贴一下, 可参考:https://www.secretflow.org.cn/zh-CN/docs/kuscia/main/deployment/logdescription
示例路径如下: /home/kuscia/var/stdout/pods/alice_xxxx/xxx/*.log
授权是正常的: alice-kuscia-system alice kuscia-system Token True bob-kuscia-system bob kuscia-system Token True alice-bob alice bob 192.168.50.100 Token True bob-alice bob alice 192.168.50.158 Token True
日志如下 alice:
2024-03-04T14:16:45.333370086+08:00 stderr F WARNING:root:Since the GPL-licensed package unidecode
is not installed, using Python's unicodedata
package which yields worse results.
2024-03-04T14:16:45.6195683+08:00 stdout F 2024-03-04 06:16:45,619|alice|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='kcnp-uhkwkbhr-node-3-0-global.alice.svc', ray_node_manager_port=29827, ray_object_manager_port=29828, ray_client_server_port=29829, ray_worker_ports=[], ray_gcs_port=29826)
2024-03-04T14:16:45.619598292+08:00 stdout F 2024-03-04 06:16:45,619|alice|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at kcnp-uhkwkbhr-node-3-0-global.alice.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=24 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=kcnp-uhkwkbhr-node-3-0-global.alice.svc --port=29826 --node-manager-port=29827 --object-manager-port=29828 --ray-client-server-port=298292024-03-04T14:16:48.084653285+08:00 stdout F 2024-03-04 06:16:48,084|alice|INFO|secretflow|entry.py:start_ray:76| 2024-03-04 06:16:46,156 INFO usage_lib.py:490 -- Usage stats collection is disabled.
2024-03-04T14:16:48.084684076+08:00 stdout F 2024-03-04 06:16:46,156 INFO scripts.py:702 -- Local node IP: kcnp-uhkwkbhr-node-3-0-global.alice.svc
2024-03-04T14:16:48.084688078+08:00 stdout F 2024-03-04 06:16:47,926 SUCC scripts.py:739 -- --------------------
2024-03-04T14:16:48.084691071+08:00 stdout F 2024-03-04 06:16:47,926 SUCC scripts.py:740 -- Ray runtime started.
2024-03-04T14:16:48.08469391+08:00 stdout F 2024-03-04 06:16:47,926 SUCC scripts.py:741 -- --------------------
2024-03-04T14:16:48.084696986+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:743 -- Next steps
2024-03-04T14:16:48.084699972+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2024-03-04T14:16:48.084703229+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:747 -- ray start --address='kcnp-uhkwkbhr-node-3-0-global.alice.svc:29826'
2024-03-04T14:16:48.084706877+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:763 -- Alternatively, use the following Python code:
2024-03-04T14:16:48.084709854+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:765 -- import ray
2024-03-04T14:16:48.084713106+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:769 -- ray.init(address='auto', _node_ip_address='kcnp-uhkwkbhr-node-3-0-global.alice.svc')
2024-03-04T14:16:48.084715932+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to
2024-03-04T14:16:48.084718765+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following
2024-03-04T14:16:48.084721641+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:789 -- Python code:
2024-03-04T14:16:48.084724476+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:791 -- import ray
2024-03-04T14:16:48.084727489+08:00 stdout F 2024-03-04 06:16:47,926 INFO scripts.py:792 -- ray.init(address='ray://unidecode
is not installed, using Python's unicodedata
package which yields worse results.
2024-03-04T14:16:45.674617021+08:00 stdout F 2024-03-04 06:16:45,674|bob|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='kcnp-uhkwkbhr-node-3-0-global.bob.svc', ray_node_manager_port=26683, ray_object_manager_port=26684, ray_client_server_port=26685, ray_worker_ports=[], ray_gcs_port=26682)
2024-03-04T14:16:45.674642408+08:00 stdout F 2024-03-04 06:16:45,674|bob|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at kcnp-uhkwkbhr-node-3-0-global.bob.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=40 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=kcnp-uhkwkbhr-node-3-0-global.bob.svc --port=26682 --node-manager-port=26683 --object-manager-port=26684 --ray-client-server-port=26685
另外,我发现通过命令行发起的两个示例job虽然状态是succeeded,但是日志里也是有error NAME STARTTIME COMPLETIONTIME LASTRECONCILETIME PHASE secretflow-task-20240304135549 25m 25m 25m Succeeded secretflow-task-20240304140911 12m 12m 12m Succeeded
alice日志,bob方没有error
2024-03-04T14:09:16.996436339+08:00 stderr F WARNING:root:Since the GPL-licensed package unidecode
is not installed, using Python's unicodedata
package which yields worse results.
2024-03-04T14:09:17.284753351+08:00 stdout F 2024-03-04 06:09:17,284|alice|INFO|secretflow|entry.py:start_ray:55| ray_conf: RayConfig(ray_node_ip_address='secretflow-task-20240304140911-single-psi-0-global.alice.svc', ray_node_manager_port=30728, ray_object_manager_port=30729, ray_client_server_port=30730, ray_worker_ports=[], ray_gcs_port=30727)
2024-03-04T14:09:17.284773787+08:00 stdout F 2024-03-04 06:09:17,284|alice|INFO|secretflow|entry.py:start_ray:59| Trying to start ray head node at secretflow-task-20240304140911-single-psi-0-global.alice.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=24 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=secretflow-task-20240304140911-single-psi-0-global.alice.svc --port=30727 --node-manager-port=30728 --object-manager-port=30729 --ray-client-server-port=30730
2024-03-04T14:09:19.748214493+08:00 stdout F 2024-03-04 06:09:19,747|alice|INFO|secretflow|entry.py:start_ray:76| 2024-03-04 06:09:17,823 INFO usage_lib.py:490 -- Usage stats collection is disabled.
2024-03-04T14:09:19.748230881+08:00 stdout F 2024-03-04 06:09:17,823 INFO scripts.py:702 -- Local node IP: secretflow-task-20240304140911-single-psi-0-global.alice.svc
2024-03-04T14:09:19.748277722+08:00 stdout F 2024-03-04 06:09:19,590 SUCC scripts.py:739 -- --------------------
2024-03-04T14:09:19.748281432+08:00 stdout F 2024-03-04 06:09:19,590 SUCC scripts.py:740 -- Ray runtime started.
2024-03-04T14:09:19.748284355+08:00 stdout F 2024-03-04 06:09:19,591 SUCC scripts.py:741 -- --------------------
2024-03-04T14:09:19.748290291+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:743 -- Next steps
2024-03-04T14:09:19.748293478+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2024-03-04T14:09:19.748296645+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:747 -- ray start --address='secretflow-task-20240304140911-single-psi-0-global.alice.svc:30727'
2024-03-04T14:09:19.748299942+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:763 -- Alternatively, use the following Python code:
2024-03-04T14:09:19.748303057+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:765 -- import ray
2024-03-04T14:09:19.748306008+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:769 -- ray.init(address='auto', _node_ip_address='secretflow-task-20240304140911-single-psi-0-global.alice.svc')
2024-03-04T14:09:19.748310277+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to
2024-03-04T14:09:19.748313127+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following
2024-03-04T14:09:19.748316011+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:789 -- Python code:
2024-03-04T14:09:19.748321392+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:791 -- import ray
2024-03-04T14:09:19.748324395+08:00 stdout F 2024-03-04 06:09:19,591 INFO scripts.py:792 -- ray.init(address='ray://
@linushio ,
从上述报错的日志中,发现 平台下发的任务中依赖的组件,在 secretflow 引擎镜像中找不到。应该是版本不匹配导致的。
部署平台,是参考的这个文档吗?建议使用平台镜像 对应的kuscia版本,进行kuscia的部署 https://www.secretflow.org.cn/zh-CN/docs/secretpad/latest/zgnd8oqo5chsqhzm
@linushio ,
从上述报错的日志中,发现 平台下发的任务中依赖的组件,在 secretflow 引擎镜像中找不到。应该是版本不匹配导致的。
部署平台,是参考的这个文档吗?建议使用平台镜像 对应的kuscia版本,进行kuscia的部署 https://www.secretflow.org.cn/zh-CN/docs/secretpad/latest/zgnd8oqo5chsqhzm
用的是这个文档https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.5.0b0/deployment/deploy_master_lite_cn#id5
@linushio anyway,目前定位到时版本兼容问题,建议kuscia先使用目前版本,secretpad版本更新到最新的。(你的latest是当时下载时最新的,对于现在来说需要更新下)
kuscia使用的是0.5.0b0,secretpad使用的是最新的
docker images 及 docker ps 都看下
107030ae40e0 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretpad:latest "/bin/sh -c 'java ${…" 2 hours ago Up 2 hours 80/tcp, 9001/tcp, 0.0.0.0:8088->8080/tcp, :::8088->8080/tcp root-kuscia-secretpad 6efa3d5e8f4a secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 "tini -- bin/kuscia …" 2 hours ago Up 2 hours 0.0.0.0:18080->1080/tcp, :::18080->1080/tcp, 0.0.0.0:18082->8082/tcp, :::18082->8082/tcp, 0.0.0.0:13083->8083/tcp, :::13083->8083/tcp root-kuscia-master
辛苦将kuscia版本更新到最新尝试。
107030ae40e0 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretpad:latest "/bin/sh -c 'java ${…" 2 hours ago Up 2 hours 80/tcp, 9001/tcp, 0.0.0.0:8088->8080/tcp, :::8088->8080/tcp root-kuscia-secretpad 6efa3d5e8f4a secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 "tini -- bin/kuscia …" 2 hours ago Up 2 hours 0.0.0.0:18080->1080/tcp, :::18080->1080/tcp, 0.0.0.0:18082->8082/tcp, :::18082->8082/tcp, 0.0.0.0:13083->8083/tcp, :::13083->8083/tcp root-kuscia-master
可以在master容器中,用命令看下 sf 的镜像版本: kubectl get appimage secretflow-image -o yaml | grep image -A 3
kuscia更新过版本,但是在数据表这一步会出现异常docker exec -it ${USER}-kuscia-lite-alice curl https://127.0.0.1:8070/api/v1/datamesh/domaindatagrant/create -X POST -H 'content-type: application/json' -d '{"author":"alice","domaindata_id":"alice-table","grant_domain":"bob"}' --cacert var/certs/ca.crt --cert var/certs/ca.crt --key var/certs/ca.key
不好意思,上面 -A 打印的内容有点少,可以换成 -A 10 kubectl get appimage secretflow-image -o yaml | grep image -A 10
你现在用的kuscia镜像是 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest 吗?
{"apiVersion":"kuscia.secretflow/v1alpha1","kind":"AppImage","metadata":{"annotations":{},"name":"secretflow-image"},"spec":{"configTemplates":{"task-config.conf":"{\n \"task_id\": \"{{.TASK_ID}}\",\n \"task_input_config\": \"{{.TASK_INPUT_CONFIG}}\",\n \"task_cluster_def\": \"{{.TASK_CLUSTER_DEFINE}}\",\n \"allocated_ports\": \"{{.ALLOCATED_PORTS}}\"\n}\n"},"deployTemplates":[{"name":"secretflow","replicas":1,"spec":{"containers":[{"args":["-c","python -m secretflow.kuscia.entry ./kuscia/task-config.conf"],"command":["sh"],"configVolumeMounts":[{"mountPath":"/root/kuscia/task-config.conf","subPath":"task-config.conf"}],"name":"secretflow","ports":[{"name":"spu","port":20000,"protocol":"GRPC","scope":"Cluster"},{"name":"fed","port":20001,"protocol":"GRPC","scope":"Cluster"},{"name":"global","port":20002,"protocol":"GRPC","scope":"Domain"},{"name":"node-manager","port":20003,"protocol":"GRPC","scope":"Local"},{"name":"object-manager","port":20004,"protocol":"GRPC","scope":"Local"},{"name":"client-server","port":20005,"protocol":"GRPC","scope":"Local"}],"workingDir":"/root"}],"restartPolicy":"Never"}}],"image":{"id":"abc","name":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8","sign":"abc","tag":"1.3.0.dev20231120"}}}
image: id: abc name: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8 sign: abc tag: 1.3.0.dev20231120
我现在用的是kusica:0.5.0b0,secretpad:latest
内置的sf版本较低,最新版的secretpad兼容sf版本为:secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 请更新下sf版本
更新deploy.sh脚本的这里就好了吧?我等下重新部署一下看看
我现在用的是kusica:0.5.0b0,secretpad:latest
现在问题是,kusica:0.5.0b0 版本对应的 secretflow 版本 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120 和 secretpad 平台的latest版本不匹配。需要升级secretflow 版本。
现在有两种可能可行的方法:
要不先尝试升级下secretflow版本:
我现在用的是kusica:0.5.0b0,secretpad:latest
现在问题是,kusica:0.5.0b0 版本对应的 secretflow 版本 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120 和 secretpad 平台的latest版本不匹配。需要升级secretflow 版本。
现在有两种可能可行的方法:
- 使用kuscia 最新的镜像 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest
- 升级secretflow版本到secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0
要不先尝试升级下secretflow版本:
- docker pull secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0
- docker save secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0 -o sf-130b0.tar
- docker cp sf-130b0.tar $USER-kuscia-lite-alice:/home/kuscia docker cp sf-130b0.tar $USER-kuscia-lite-bob:/home/kuscia
- docker exec $USER-kuscia-lite-alice ctr -a=/home/kuscia/containerd/run/containerd.sock -n=k8s.io images import sf-130b0.tar
- docker exec $USER-kuscia-lite-bob ctr -a=/home/kuscia/containerd/run/containerd.sock -n=k8s.io images import sf-130b0.tar
- 登陆到 master 容器中,修改appimage 中sf镜像版本为1.3.0b0 kubectl edit appimage secretflow-image
更换1.3.0b0,本地作业都失败了,我现在试试latest
我现在用的是kusica:0.5.0b0,secretpad:latest
现在问题是,kusica:0.5.0b0 版本对应的 secretflow 版本 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0.dev20231120 和 secretpad 平台的latest版本不匹配。需要升级secretflow 版本。 现在有两种可能可行的方法:
- 使用kuscia 最新的镜像 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest
- 升级secretflow版本到secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0
要不先尝试升级下secretflow版本:
- docker pull secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0
- docker save secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0 -o sf-130b0.tar
- docker cp sf-130b0.tar $USER-kuscia-lite-alice:/home/kuscia docker cp sf-130b0.tar $USER-kuscia-lite-bob:/home/kuscia
- docker exec $USER-kuscia-lite-alice ctr -a=/home/kuscia/containerd/run/containerd.sock -n=k8s.io images import sf-130b0.tar
- docker exec $USER-kuscia-lite-bob ctr -a=/home/kuscia/containerd/run/containerd.sock -n=k8s.io images import sf-130b0.tar
- 登陆到 master 容器中,修改appimage 中sf镜像版本为1.3.0b0 kubectl edit appimage secretflow-image
更换1.3.0b0,本地作业都失败了,我现在试试latest
试一下这个镜像:secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601
secretpad latest版本 匹配 kuscia: secretflow/kuscia:0.5.0b0 secretflow: secretflow/secretflow-lite-anolis8:1.4.0.dev24011601
Contributor
修改后本地测试作业运行失败 2024-03-04 17:36:16.796 INFO status/status_manager.go:625 Patch status for pod "secretflow-task-20240304173606-single-psi-0_alice(a682e318-d21c-4646-a3a8-a405ff18fbe3)", patch={"metadata":{"uid":"a682e318-d21c-4646-a3a8-a405ff18fbe3"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastTransitionTime":"2024-03-04T09:36:16Z","reason":"PodFailed","status":"False","type":"Ready"},{"lastTransitionTime":"2024-03-04T09:36:16Z","reason":"PodFailed","status":"False","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://0ee0029ee57045ecd18d1e364fda690d685f8f7406abdf980518450cc36fe28b","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601","imageID":"sha256:86495c6fde3238e2c17702340900ddf429b5dd76b79976fffab82d56de38efea","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://0ee0029ee57045ecd18d1e364fda690d685f8f7406abdf980518450cc36fe28b","exitCode":1,"finishedAt":"2024-03-04T09:36:15Z","message":" name: \"min_frequency\"\n desc: \"Specifies the minimum frequency below which a category will be considered infrequent, [0, 1), 0 disable\"\n type: AT_FLOAT\n atomic {\n is_optional: true\n default_value {\n }\n lower_bound_enabled: true\n lower_bound {\n }\n lower_bound_inclusive: true\n upper_bound_enabled: true\n upper_bound {\n f: 1.0\n }\n }\n }\n attrs {\n name: \"report_rules\"\n desc: \"Whether to report rule details\"\n type: AT_BOOL\n atomic {\n is_optional: true\n default_value {\n b: true\n }\n }\n }\n inputs {\n name: \"input_dataset\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"features\"\n desc: \"Features to encode.\"\n }\n }\n outputs {\n name: \"output_dataset\"\n desc: \"output_dataset\"\n types: \"sf.table.vertical_table\"\n }\n outputs {\n name: \"out_rules\"\n desc: \"onehot rule\"\n types: \"sf.rule.preprocessing\"\n }\n outputs {\n name: \"report\"\n desc: \"report rules details if report_rules is true\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"preprocessing\"\n name: \"substitution\"\n desc: \"unified substitution component\"\n version: \"0.0.2\"\n inputs {\n name: \"input_dataset\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n }\n inputs {\n name: \"input_rules\"\n desc: \"Input preprocessing rules\"\n types: \"sf.rule.preprocessing\"\n }\n outputs {\n name: \"output_dataset\"\n desc: \"output_dataset\"\n types: \"sf.table.vertical_table\"\n }\n}\ncomps {\n domain: \"preprocessing\"\n name: \"vert_bin_substitution\"\n desc: \"Substitute datasets\' value by bin substitution rules.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Vertical partitioning dataset to be substituted.\"\n types: \"sf.table.vertical_table\"\n }\n inputs {\n name: \"bin_rule\"\n desc: \"Input bin substitution rule.\"\n types: \"sf.rule.binning\"\n }\n outputs {\n name: \"output_data\"\n desc: \"Output vertical table.\"\n types: \"sf.table.vertical_table\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"groupby_statistics\"\n desc: \"Get a groupby of statistics, like pandas groupby statistics.\nCurrently only support VDataframe.\"\n version: \"0.0.3\"\n attrs {\n name: \"aggregation_config\"\n desc: \"input groupby aggregation config\"\n type: AT_CUSTOM_PROTOBUF\n custom_protobuf_cls: \"groupby_aggregation_config_pb2.GroupbyAggregationConfig\"\n }\n attrs {\n name: \"max_group_size\"\n desc: \"The maximum number of groups allowed\"\n type: AT_INT\n atomic {\n is_optional: true\n default_value {\n i64: 10000\n }\n lower_bound_enabled: true\n lower_bound {\n }\n upper_bound_enabled: true\n upper_bound {\n i64: 10001\n }\n }\n }\n inputs {\n name: \"input_data\"\n desc: \"Input table.\"\n types: \"sf.table.vertical_table\"\n types: \"sf.table.individual\"\n attrs {\n name: \"by\"\n desc: \"by what columns should we group the values\"\n col_min_cnt_inclusive: 1\n col_max_cnt_inclusive: 4\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output groupby statistics report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"ss_pearsonr\"\n desc: \"Calculate Pearson\'s product-moment correlation coefficient for vertical partitioning dataset\nby using secret sharing.\n- For large dataset(large than 10w samples \u0026 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"feature_selects\"\n desc: \"Specify which features to calculate correlation coefficient with. If empty, all features will be used\"\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output Pearson\'s product-moment correlation coefficient report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"ss_vif\"\n desc: \"Calculate Variance Inflation Factor(VIF) for vertical partitioning dataset\nby using secret sharing.\n- For large dataset(large than 10w samples \u0026 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"feature_selects\"\n desc: \"Specify which features to calculate VIF with. If empty, all features will be used.\"\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output Variance Inflation Factor(VIF) report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"table_statistics\"\n desc: \"Get a table of statistics,\nincluding each column\'s\n1. datatype\n2. total_count\n3. count\n4. count_na\n5. na_ratio\n6. min\n7. max\n8. mean\n9. var\n10. std\n11. sem\n12. skewness\n13. kurtosis\n14. q1\n15. q2\n16. q3\n17. moment_2\n18. moment_3\n19. moment_4\n20. central_moment_2\n21. central_moment_3\n22. central_moment_4\n23. sum\n24. sum_2\n25. sum_3\n26. sum_4\n- moment_2 means E[X^2].\n- central_moment_2 means E[(X - mean(X))^2].\n- sum_2 means sum(X^2).\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input table.\"\n types: \"sf.table.vertical_table\"\n types: \"sf.table.individual\"\n }\n outputs {\n name: \"report\"\n desc: \"Output table statistics report.\"\n types: \"sf.report\"\n }\n}\n\n","reason":"Error","startedAt":"2024-03-04T09:36:09Z"}}}]}}
我注意到deploy.sh中还有两处kuscia的镜像地址,是不是也要修改成0.5.0b0 if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}"
if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 fi log "SECRETFLOW_IMAGE=${SECRETFLOW_IMAGE}"
SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"
我注意到deploy.sh中还有两处kuscia的镜像地址,是不是也要修改成0.5.0b0 if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}"
if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 fi log "SECRETFLOW_IMAGE=${SECRETFLOW_IMAGE}"
SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"
嗯,kuscia版本不对,需要改成0.5.0b0
KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0
SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601
部署最新版本,按照文档来有几个问题:
我注意到deploy.sh中还有两处kuscia的镜像地址,是不是也要修改成0.5.0b0 if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}" if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 fi log "SECRETFLOW_IMAGE=${SECRETFLOW_IMAGE}" SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"
嗯,kuscia版本不对,需要改成0.5.0b0
KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0
SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601
SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"这个也要改吗?
我注意到deploy.sh中还有两处kuscia的镜像地址,是不是也要修改成0.5.0b0 if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}" if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 fi log "SECRETFLOW_IMAGE=${SECRETFLOW_IMAGE}" SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"
嗯,kuscia版本不对,需要改成0.5.0b0 KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601
SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"这个也要改吗?
SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow" 不用改
我注意到deploy.sh中还有两处kuscia的镜像地址,是不是也要修改成0.5.0b0 if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:latest fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}" if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601 fi log "SECRETFLOW_IMAGE=${SECRETFLOW_IMAGE}" SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"
嗯,kuscia版本不对,需要改成0.5.0b0 KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601
SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow"这个也要改吗?
SF_IMAGE_REGISTRY="secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow" 不用改
本地测试作业依然报错: 2024-03-04 17:49:23.283 INFO status/status_manager.go:625 Patch status for pod "secretflow-task-20240304174913-single-psi-0_alice(48587eff-432b-472e-b5e2-af04eded6fb6)", patch={"metadata":{"uid":"48587eff-432b-472e-b5e2-af04eded6fb6"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastTransitionTime":"2024-03-04T09:49:23Z","reason":"PodFailed","status":"False","type":"Ready"},{"lastTransitionTime":"2024-03-04T09:49:23Z","reason":"PodFailed","status":"False","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://56814abe86a308117f76fdbf28ae9abf975f80b719b700f11a39ecdeb6d31035","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601","imageID":"sha256:86495c6fde3238e2c17702340900ddf429b5dd76b79976fffab82d56de38efea","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://56814abe86a308117f76fdbf28ae9abf975f80b719b700f11a39ecdeb6d31035","exitCode":1,"finishedAt":"2024-03-04T09:49:22Z","message":" name: \"min_frequency\"\n desc: \"Specifies the minimum frequency below which a category will be considered infrequent, [0, 1), 0 disable\"\n type: AT_FLOAT\n atomic {\n is_optional: true\n default_value {\n }\n lower_bound_enabled: true\n lower_bound {\n }\n lower_bound_inclusive: true\n upper_bound_enabled: true\n upper_bound {\n f: 1.0\n }\n }\n }\n attrs {\n name: \"report_rules\"\n desc: \"Whether to report rule details\"\n type: AT_BOOL\n atomic {\n is_optional: true\n default_value {\n b: true\n }\n }\n }\n inputs {\n name: \"input_dataset\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"features\"\n desc: \"Features to encode.\"\n }\n }\n outputs {\n name: \"output_dataset\"\n desc: \"output_dataset\"\n types: \"sf.table.vertical_table\"\n }\n outputs {\n name: \"out_rules\"\n desc: \"onehot rule\"\n types: \"sf.rule.preprocessing\"\n }\n outputs {\n name: \"report\"\n desc: \"report rules details if report_rules is true\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"preprocessing\"\n name: \"substitution\"\n desc: \"unified substitution component\"\n version: \"0.0.2\"\n inputs {\n name: \"input_dataset\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n }\n inputs {\n name: \"input_rules\"\n desc: \"Input preprocessing rules\"\n types: \"sf.rule.preprocessing\"\n }\n outputs {\n name: \"output_dataset\"\n desc: \"output_dataset\"\n types: \"sf.table.vertical_table\"\n }\n}\ncomps {\n domain: \"preprocessing\"\n name: \"vert_bin_substitution\"\n desc: \"Substitute datasets\' value by bin substitution rules.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Vertical partitioning dataset to be substituted.\"\n types: \"sf.table.vertical_table\"\n }\n inputs {\n name: \"bin_rule\"\n desc: \"Input bin substitution rule.\"\n types: \"sf.rule.binning\"\n }\n outputs {\n name: \"output_data\"\n desc: \"Output vertical table.\"\n types: \"sf.table.vertical_table\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"groupby_statistics\"\n desc: \"Get a groupby of statistics, like pandas groupby statistics.\nCurrently only support VDataframe.\"\n version: \"0.0.3\"\n attrs {\n name: \"aggregation_config\"\n desc: \"input groupby aggregation config\"\n type: AT_CUSTOM_PROTOBUF\n custom_protobuf_cls: \"groupby_aggregation_config_pb2.GroupbyAggregationConfig\"\n }\n attrs {\n name: \"max_group_size\"\n desc: \"The maximum number of groups allowed\"\n type: AT_INT\n atomic {\n is_optional: true\n default_value {\n i64: 10000\n }\n lower_bound_enabled: true\n lower_bound {\n }\n upper_bound_enabled: true\n upper_bound {\n i64: 10001\n }\n }\n }\n inputs {\n name: \"input_data\"\n desc: \"Input table.\"\n types: \"sf.table.vertical_table\"\n types: \"sf.table.individual\"\n attrs {\n name: \"by\"\n desc: \"by what columns should we group the values\"\n col_min_cnt_inclusive: 1\n col_max_cnt_inclusive: 4\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output groupby statistics report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"ss_pearsonr\"\n desc: \"Calculate Pearson\'s product-moment correlation coefficient for vertical partitioning dataset\nby using secret sharing.\n- For large dataset(large than 10w samples \u0026 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"feature_selects\"\n desc: \"Specify which features to calculate correlation coefficient with. If empty, all features will be used\"\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output Pearson\'s product-moment correlation coefficient report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"ss_vif\"\n desc: \"Calculate Variance Inflation Factor(VIF) for vertical partitioning dataset\nby using secret sharing.\n- For large dataset(large than 10w samples \u0026 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device.\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input vertical table.\"\n types: \"sf.table.vertical_table\"\n attrs {\n name: \"feature_selects\"\n desc: \"Specify which features to calculate VIF with. If empty, all features will be used.\"\n }\n }\n outputs {\n name: \"report\"\n desc: \"Output Variance Inflation Factor(VIF) report.\"\n types: \"sf.report\"\n }\n}\ncomps {\n domain: \"stats\"\n name: \"table_statistics\"\n desc: \"Get a table of statistics,\nincluding each column\'s\n1. datatype\n2. total_count\n3. count\n4. count_na\n5. na_ratio\n6. min\n7. max\n8. mean\n9. var\n10. std\n11. sem\n12. skewness\n13. kurtosis\n14. q1\n15. q2\n16. q3\n17. moment_2\n18. moment_3\n19. moment_4\n20. central_moment_2\n21. central_moment_3\n22. central_moment_4\n23. sum\n24. sum_2\n25. sum_3\n26. sum_4\n- moment_2 means E[X^2].\n- central_moment_2 means E[(X - mean(X))^2].\n- sum_2 means sum(X^2).\"\n version: \"0.0.1\"\n inputs {\n name: \"input_data\"\n desc: \"Input table.\"\n types: \"sf.table.vertical_table\"\n types: \"sf.table.individual\"\n }\n outputs {\n name: \"report\"\n desc: \"Output table statistics report.\"\n types: \"sf.report\"\n }\n}\n\n","reason":"Error","startedAt":"2024-03-04T09:49:17Z"}}}]}}
if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}"
if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601
if [[ ${KUSCIA_IMAGE} == "" ]]; then KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0 fi log "KUSCIA_IMAGE=${KUSCIA_IMAGE}"
if [[ "$SECRETFLOW_IMAGE" == "" ]]; then SECRETFLOW_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.4.0.dev24011601
上面报错是平台上下的作业吗?
是的docker exec -it ${USER}-kuscia-master scripts/user/create_example_job.sh 目前,我试了很多版本,只有kuscia改成0.5.0b0版本,其他不动可以跑通这个作业,但是secretpad平台上运行psi也是失败的 只在这一步修改:export KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0
是的docker exec -it ${USER}-kuscia-master scripts/user/create_example_job.sh 目前,我试了很多版本,只有kuscia改成0.5.0b0版本,其他不动可以跑通这个作业,但是secretpad平台上运行psi也是失败的 只在这一步修改:export KUSCIA_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.5.0b0
因为改了 sf 的版本,所以kuscia 自带的示例是有可能失败的。 所以需要从平台下发任务。
感谢!secretpad跑通了
copy kuscia api lite:alice client certs
copy_kuscia_api_lite_client_certs ${ALICE_DOMAIN} ${volume_path}
copy kuscia api lite:bob client certs
copy_kuscia_api_lite_client_certs ${BOB_DOMAIN} ${volume_path} 获取证书是从当前服务器获取的 function copy_kuscia_api_lite_client_certs() { local domain_id=$1 local volume_path=$2 local IMAGE=$SECRETPAD_IMAGE local domain_ctr=${CTR_PREFIX}-lite-${domain_id}
generate client certs
docker exec -it ${domain_ctr} sh scripts/deploy/init_kusciaapi_client_certs.sh
copy result
tmp_path=${volume_path}/temps/certs/${domain_id} mkdir -p ${tmp_path} docker cp ${domain_ctr}:/${CTR_CERT_ROOT}/ca.crt ${tmp_path}/ca.crt docker cp ${domain_ctr}:/${CTR_CERT_ROOT}/kusciaapi-client.crt ${tmp_path}/client.crt docker cp ${domain_ctr}:/${CTR_CERT_ROOT}/kusciaapi-client.key ${tmp_path}/client.pem docker cp ${domain_ctr}:/${CTR_CERT_ROOT}/token ${tmp_path}/token docker run -d --rm --name ${CTR_PREFIX}-dummy --volume=${volume_path}/secretpad/config/certs:/tmp/temp $IMAGE tail -f /dev/null >/dev/null 2>&1 docker cp -a ${tmp_path} ${CTR_PREFIX}-dummy:/tmp/temp/ docker rm -f ${CTR_PREFIX}-dummy >/dev/null 2>&1 rm -rf ${volume_path}/temp log "copy kuscia api client lite :${domain_id} certs to web server container done" }