secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
73 stars 56 forks source link

基于docker部署时,KKRT/RR22协议失败 #366

Open coderSun20201112 opened 4 months ago

coderSun20201112 commented 4 months ago

Issue Type

Others

Search for existing issues similar to yours

Yes

Kuscia Version

kuscia 0.5.0

Link to Relevant Documentation

No response

Question Details

我使用docker镜像方式按照P2P方式组网,一个节点是人行,另外一个节点是商行,进行PSI隐私求交,测试时ECDH协议很顺利,但KKRT/RR22一直报错,因此,寻求解决方法,下面是日志信息:

RR22的日志信息:
2024-07-05T10:54:04.02101077+08:00 stdout F [2024-07-05 10:54:04.020] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[43b39692bdf27cbe];[content-length]:[95];[kuscia-error-message]:[Domain shanghang.root-kuscia-autonomy-shanghang<--Domain renhang.root-kuscia-autonomy-renhang<--10.2.11.62 return http code 503.];[x-accel-buffering]:[no];[x-b3-spanid]:[43b39692bdf27cbe];[x-envoy-upstream-service-time]:[74];[date]:[Fri, 05 Jul 2024 02:54:03 GMT];[server]:[envoy];', response body '', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: connection termination'
2024-07-05T10:54:05.023777727+08:00 stdout F [2024-07-05 10:54:05.023] [info] [channel.cc:352] send request failed and retry, retry_count=2, max_retry=3, interval_ms=3000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[d140d7eeeca679af];[content-length]:[145];[kuscia-error-message]:[Domain shanghang.root-kuscia-autonomy-shanghang<--Domain renhang.root-kuscia-autonomy-renhang<--10.2.11.62 return http code 503.];[x-accel-buffering]:[no];[x-b3-spanid]:[d140d7eeeca679af];[x-envoy-upstream-service-time]:[1];[date]:[Fri, 05 Jul 2024 02:54:04 GMT];[server]:[envoy];', response body '', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111'

KKRT日志信息:
2024-07-05T09:50:32.168913916+08:00 stdout F [2024-07-05 09:50:32.168] [info] [csv_checker.cc:241] Executing script to get duplicates: LC_ALL=C tail -n +2 /tmp/685ce8dc-7bb6-4242-bcef-3d55ab800137.psi_checked | LC_ALL=C sort --parallel=8 --buffer-size=1G --stable | LC_ALL=C uniq -d > /tmp/685ce8dc-7bb6-4242-bcef-3d55ab800137.psi_checked_duplicates
2024-07-05T09:50:37.792264335+08:00 stdout F [2024-07-05 09:50:37.792] [info] [csv_checker.cc:271] Executing script to get hash digest: sha256sum /tmp/685ce8dc-7bb6-4242-bcef-3d55ab800137.psi_checked
2024-07-05T09:50:39.812578523+08:00 stdout F [2024-07-05 09:50:39.812] [info] [interface.cc:143] [AbstractPsiParty::Init][Check csv pre-process] end
2024-07-05T09:50:39.819929341+08:00 stdout F [2024-07-05 09:50:39.819] [info] [interface.cc:183] [AbstractPsiParty::Init] end
2024-07-05T09:50:39.820391329+08:00 stdout F [2024-07-05 09:50:39.820] [info] [receiver.cc:42] [KkrtPsiReceiver::Init] end
2024-07-05T09:50:39.820403246+08:00 stdout F [2024-07-05 09:50:39.820] [info] [receiver.cc:47] [KkrtPsiReceiver::PreProcess] start
2024-07-05T09:50:39.820478147+08:00 stdout F [2024-07-05 09:50:39.820] [info] [bucket_psi.cc:515] psi protocol=2, rank=0 item_size=10000
2024-07-05T09:50:39.82048501+08:00 stdout F [2024-07-05 09:50:39.820] [info] [bucket_psi.cc:515] psi protocol=2, rank=1 item_size=10000000
2024-07-05T09:50:51.568753187+08:00 stdout F [2024-07-05 09:50:51.568] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/alice_psi.csv.
2024-07-05T09:50:51.569585197+08:00 stdout F [2024-07-05 09:50:51.569] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/alice_psi.csv.
aokaokd commented 4 months ago

你的数据里面包含重复数据吗

coderSun20201112 commented 4 months ago

你的数据里面包含重复数据吗

好的,我检查一下数据,我问问业务部门

coderSun20201112 commented 4 months ago

你的数据里面包含重复数据吗

我新造了1000条测试数据,其中交集是560条,且这560条记录的"身份证号码"各不相同,而我也是用“身份证号码”作为求交列,即便这样,还是失败

aokaokd commented 4 months ago

好的, 失败日志和上面相同吗。

coderSun20201112 commented 4 months ago

好的, 失败日志和上面相同吗。

相同

zimu-yuxi commented 4 months ago

好的, 失败日志和上面相同吗。

相同

是否有更多的任务日志信息。可以在kuscia容器内,/home/kuscia/var/stdout/路径下找到报错任务id的日志

coderSun20201112 commented 4 months ago

好的, 失败日志和上面相同吗。

相同

是否有更多的任务日志信息。可以在kuscia容器内,/home/kuscia/var/stdout/路径下找到报错任务id的日志

基于RR22做了一次测试,下面是日志信息:

pod下的日志 2024-07-10T18:25:45.296799503+08:00 stdout F [2024-07-10 18:25:45.281] [info] [main.cc:44] SecretFlow PSI Library v0.2.0.dev240123 Copyright 2023 Ant Group Co., Ltd. 2024-07-10T18:25:45.299321156+08:00 stdout F [2024-07-10 18:25:45.299] [info] [main.cc:56] Kuscia task id: yqxxeraj 2024-07-10T18:25:45.317512483+08:00 stderr F I0710 18:25:45.317143 7 external/com_github_brpc_brpc/src/brpc/server.cpp:1158] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=54509. 2024-07-10T18:25:45.317571852+08:00 stderr F W0710 18:25:45.317178 7 external/com_github_brpc_brpc/src/brpc/server.cpp:1164] Builtin services are disabled according to ServerOptions.has_builtin_services 2024-07-10T18:25:48.547728713+08:00 stderr F I0710 18:25:48.547527 26 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240710.182548.7/id.db and ./rpc_data/rpcz/20240710.182548.7/time.db 2024-07-10T18:25:51.363771015+08:00 stderr F [978.334] perfetto.cc:45899 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024 KB, total sessions:1, uid:0 session name: "" 2024-07-10T18:25:51.364221936+08:00 stdout F [2024-07-10 18:25:51.364] [info] [launch.cc:115] PSI config: {"protocol_config":{"protocol":"PROTOCOL_RR22","role":"ROLE_SENDER","ecdh_config":{"curve":"CURVE_FOURQ"},"kkrt_config":{"bucket_size":"1048576"},"rr22_config":{"bucket_size":"1048576"}},"input_config":{"type":"IO_TYPE_FILE_CSV","path":"/home/kuscia/var/storage/data/learn_440_1980-01-01.csv"},"output_config":{"type":"IO_TYPE_FILE_CSV","path":"/home/kuscia/var/storage/data/result/yqxxeraj/"},"keys":["证件号码"],"recovery_config":{"enabled":true,"folder":"/home/kuscia/var/storage/data/tmp/yqxxeraj/"},"left_side":"ROLE_RECEIVER"} 2024-07-10T18:25:51.364241907+08:00 stdout F [2024-07-10 18:25:51.364] [info] [sender.cc:35] [Rr22PsiSender::Init] start 2024-07-10T18:25:51.364248729+08:00 stdout F [2024-07-10 18:25:51.364] [info] [interface.cc:76] [AbstractPsiParty::Init] start 2024-07-10T18:25:51.364255072+08:00 stdout F [2024-07-10 18:25:51.364] [warning] [interface.cc:300] check_hash_digest turns off while recovery is enabled. check_hash_digest is modified to true for robustness. 2024-07-10T18:25:51.371614123+08:00 stdout F [2024-07-10 18:25:51.371] [info] [interface.cc:134] [AbstractPsiParty::Init][Check csv pre-process] start 2024-07-10T18:25:51.379577942+08:00 stdout F [2024-07-10 18:25:51.379] [info] [csv_checker.cc:241] Executing script to get duplicates: LC_ALL=C tail -n +2 /tmp/f4dd1be1-c6bb-4781-9feb-eb7db92270c5.psi_checked | LC_ALL=C sort --parallel=8 --buffer-size=1G --stable | LC_ALL=C uniq -d > /tmp/f4dd1be1-c6bb-4781-9feb-eb7db92270c5.psi_checked_duplicates 2024-07-10T18:25:51.414585957+08:00 stdout F [2024-07-10 18:25:51.414] [info] [csv_checker.cc:271] Executing script to get hash digest: sha256sum /tmp/f4dd1be1-c6bb-4781-9feb-eb7db92270c5.psi_checked 2024-07-10T18:25:51.428806196+08:00 stdout F [2024-07-10 18:25:51.428] [info] [interface.cc:143] [AbstractPsiParty::Init][Check csv pre-process] end 2024-07-10T18:25:51.433757927+08:00 stdout F [2024-07-10 18:25:51.433] [info] [interface.cc:183] [AbstractPsiParty::Init] end 2024-07-10T18:25:51.434165661+08:00 stdout F [2024-07-10 18:25:51.433] [info] [sender.cc:40] [Rr22PsiSender::Init] end 2024-07-10T18:25:51.434179781+08:00 stdout F [2024-07-10 18:25:51.434] [info] [sender.cc:45] [Rr22PsiSender::PreProcess] start 2024-07-10T18:25:51.434198772+08:00 stdout F [2024-07-10 18:25:51.434] [info] [bucket_psi.cc:515] psi protocol=3, rank=0 item_size=1000 2024-07-10T18:25:51.434205583+08:00 stdout F [2024-07-10 18:25:51.434] [info] [bucket_psi.cc:515] psi protocol=3, rank=1 item_size=1000 2024-07-10T18:25:51.436311717+08:00 stdout F [2024-07-10 18:25:51.436] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/learn_440_1980-01-01.csv. 2024-07-10T18:25:51.436942769+08:00 stdout F [2024-07-10 18:25:51.436] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/learn_440_1980-01-01.csv. 2024-07-10T18:25:51.439140404+08:00 stdout F [2024-07-10 18:25:51.439] [info] [sender.cc:79] [Rr22PsiSender::PreProcess] end 2024-07-10T18:25:51.441284522+08:00 stdout F [2024-07-10 18:25:51.441] [info] [sender.cc:84] [Rr22PsiSender::Online] start 2024-07-10T18:25:51.442326142+08:00 stdout F [2024-07-10 18:25:51.441] [info] [recovery.cc:188] RecoveryManager::MarkOnlineStart ecdh_dual_masked_cnt_frompeer = 0 2024-07-10T18:25:51.442357509+08:00 stdout F [2024-07-10 18:25:51.441] [info] [recovery.cc:192] RecoveryManager::MarkOnlineStart parsed_bucket_count_frompeer = 0 2024-07-10T18:25:51.446471601+08:00 stdout F [2024-07-10 18:25:51.446] [info] [bucket.cc:37] psi protocol=3, rank=0, inputs_size=1000 2024-07-10T18:25:51.446489556+08:00 stdout F [2024-07-10 18:25:51.446] [info] [bucket.cc:37] psi protocol=3, rank=1, inputs_size=1000 2024-07-10T18:25:51.44651467+08:00 stdout F [2024-07-10 18:25:51.446] [info] [bucket.cc:50] run psi bucket_idx=0, bucket_item_size=1000 2024-07-10T18:25:51.448829406+08:00 stdout F [2024-07-10 18:25:51.448] [info] [thread_pool.cc:30] Create a fixed thread pool with size 7 2024-07-10T18:25:51.45112501+08:00 stdout F [2024-07-10 18:25:51.450] [info] [rr22_oprf.cc:139] recv paxos seed... 2024-07-10T18:25:51.456876126+08:00 stdout F [2024-07-10 18:25:51.456] [info] [rr22_oprf.cc:145] recv paxos seed finished 2024-07-10T18:25:51.4569054+08:00 stdout F [2024-07-10 18:25:51.456] [info] [rr22_oprf.cc:176] begin vole send

pod配置信息 [root@root-kuscia-autonomy-renhang kuscia]# kubectl describe pods yqxxeraj-0 --namespace=renhang Name: yqxxeraj-0 Namespace: renhang Priority: 0 Service Account: default Node: root-kuscia-autonomy-renhang/172.18.0.3 Start Time: Wed, 10 Jul 2024 18:25:42 +0800 Labels: kuscia.secretflow/communication-role-client=true kuscia.secretflow/communication-role-server=true kuscia.secretflow/controller=kusciatask kuscia.secretflow/initiator=renhang kuscia.secretflow/interconn-protocol-type=kuscia kuscia.secretflow/task-id=yqxxeraj kuscia.secretflow/task-resource=yqxxeraj-d74ad9f504a9 kuscia.secretflow/task-resource-group=yqxxeraj task.kuscia.secretflow/pod-name=yqxxeraj-0 task.kuscia.secretflow/pod-role= Annotations: kuscia.secretflow/config-template-volumes: config-template kuscia.secretflow/image-id: sha256:ae331537eb75b273358b63a7b67d7aa80c190888cb38064360db5e60b6540b15 kuscia.secretflow/taskresource-reserving-timestamp: 2024-07-10T18:25:42+08:00 Status: Failed IP:
IPs: Controlled By: KusciaTask/yqxxeraj Containers: secretflow: Container ID: containerd://73e2004ceeed41086797dcb848597dddd9c1713c5d71724de521330def4593c7 Image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/psi-anolis8:0.2.0.dev240123 Image ID: sha256:ae331537eb75b273358b63a7b67d7aa80c190888cb38064360db5e60b6540b15 Port: 54509/TCP Host Port: 0/TCP Command: sh Args: -c /root/main --kuscia /etc/kuscia/task-config.conf State: Terminated Reason: Error Message: 1G --stable | LC_ALL=C uniq -d > /tmp/f4dd1be1-c6bb-4781-9feb-eb7db92270c5.psi_checked_duplicates [2024-07-10 18:25:51.414] [info] [csv_checker.cc:271] Executing script to get hash digest: sha256sum /tmp/f4dd1be1-c6bb-4781-9feb-eb7db92270c5.psi_checked [2024-07-10 18:25:51.428] [info] [interface.cc:143] [AbstractPsiParty::Init][Check csv pre-process] end [2024-07-10 18:25:51.433] [info] [interface.cc:183] [AbstractPsiParty::Init] end [2024-07-10 18:25:51.433] [info] [sender.cc:40] [Rr22PsiSender::Init] end [2024-07-10 18:25:51.434] [info] [sender.cc:45] [Rr22PsiSender::PreProcess] start [2024-07-10 18:25:51.434] [info] [bucket_psi.cc:515] psi protocol=3, rank=0 item_size=1000 [2024-07-10 18:25:51.434] [info] [bucket_psi.cc:515] psi protocol=3, rank=1 item_size=1000 [2024-07-10 18:25:51.436] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/learn_440_1980-01-01.csv. [2024-07-10 18:25:51.436] [info] [arrow_csv_batch_provider.cc:51] Reach the end of csv file /home/kuscia/var/storage/data/learn_440_1980-01-01.csv. [2024-07-10 18:25:51.439] [info] [sender.cc:79] [Rr22PsiSender::PreProcess] end [2024-07-10 18:25:51.441] [info] [sender.cc:84] [Rr22PsiSender::Online] start [2024-07-10 18:25:51.441] [info] [recovery.cc:188] RecoveryManager::MarkOnlineStart ecdh_dual_masked_cnt_frompeer = 0 [2024-07-10 18:25:51.441] [info] [recovery.cc:192] RecoveryManager::MarkOnlineStart parsed_bucket_count_frompeer = 0 [2024-07-10 18:25:51.446] [info] [bucket.cc:37] psi protocol=3, rank=0, inputs_size=1000 [2024-07-10 18:25:51.446] [info] [bucket.cc:37] psi protocol=3, rank=1, inputs_size=1000 [2024-07-10 18:25:51.446] [info] [bucket.cc:50] run psi bucket_idx=0, bucket_item_size=1000 [2024-07-10 18:25:51.448] [info] [thread_pool.cc:30] Create a fixed thread pool with size 7 [2024-07-10 18:25:51.450] [info] [rr22_oprf.cc:139] recv paxos seed... [2024-07-10 18:25:51.456] [info] [rr22_oprf.cc:145] recv paxos seed finished [2024-07-10 18:25:51.456] [info] [rr22_oprf.cc:176] begin vole send

  Exit Code:    132
  Started:      Wed, 10 Jul 2024 18:25:45 +0800
  Finished:     Wed, 10 Jul 2024 18:25:51 +0800
Ready:          False
Restart Count:  0
Environment:
  TASK_ID:              yqxxeraj
  TASK_CLUSTER_DEFINE:  {"parties":[{"name":"shanghang", "role":"", "services":[{"portName":"psi", "endpoints":["yqxxeraj-0-psi.shanghang.svc"]}]}, {"name":"renhang", "role":"", "services":[{"portName":"psi", "endpoints":["yqxxeraj-0-psi.renhang.svc"]}]}], "selfPartyIdx":1, "selfEndpointIdx":0}
  ALLOCATED_PORTS:      {"ports":[{"name":"psi", "port":54509, "scope":"Cluster", "protocol":"HTTP"}]}
  TASK_INPUT_CONFIG:    {
                          "sf_psi_config_map": {
                            "shanghang": {
                              "link_config": {
                                "recv_timeout_ms": "30000",
                                "http_timeout_ms": 30000
                              },
                              "psi_config": {
                                "protocol_config": {
                                  "protocol": "PROTOCOL_RR22",
                                  "role": "ROLE_RECEIVER",
                                  "ecdh_config": {
                                    "curve": "CURVE_FOURQ"
                                  },
                                  "kkrt_config": {
                                    "bucket_size": "1048576"
                                  },
                                  "rr22_config": {
                                    "bucket_size": "1048576"
                                  }
                                },
                                "input_config": {
                                  "type": "IO_TYPE_FILE_CSV",
                                  "path": "/home/kuscia/var/storage/data/learn_440_1970-01-01.csv"
                                },
                                "output_config": {
                                  "type": "IO_TYPE_FILE_CSV",
                                  "path": "/home/kuscia/var/storage/data/result/yqxxeraj/result2.csv"
                                },
                                "keys": ["证件号码"],
                                "recovery_config": {
                                  "enabled": true,
                                  "folder": "/home/kuscia/var/storage/data/tmp/yqxxeraj/"
                                },
                                "left_side": "ROLE_RECEIVER"
                              }
                            },
                            "renhang": {
                              "link_config": {
                                "recv_timeout_ms": "30000",
                                "http_timeout_ms": 30000
                              },
                              "psi_config": {
                                "protocol_config": {
                                  "protocol": "PROTOCOL_RR22",
                                  "role": "ROLE_SENDER",
                                  "ecdh_config": {
                                    "curve": "CURVE_FOURQ"
                                  },
                                  "kkrt_config": {
                                    "bucket_size": "1048576"
                                  },
                                  "rr22_config": {
                                    "bucket_size": "1048576"
                                  }
                                },
                                "input_config": {
                                  "type": "IO_TYPE_FILE_CSV",
                                  "path": "/home/kuscia/var/storage/data/learn_440_1980-01-01.csv"
                                },
                                "output_config": {
                                  "type": "IO_TYPE_FILE_CSV",
                                  "path": "/home/kuscia/var/storage/data/result/yqxxeraj/"
                                },
                                "keys": ["证件号码"],
                                "recovery_config": {
                                  "enabled": true,
                                  "folder": "/home/kuscia/var/storage/data/tmp/yqxxeraj/"
                                },
                                "left_side": "ROLE_RECEIVER"
                              }
                            }
                          }
                        }
Mounts:
  /etc/kuscia/task-config.conf from config-template (rw,path="task-config.conf")

Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: config-template: Type: ConfigMap (a volume populated by a ConfigMap) Name: yqxxeraj-configtemplate Optional: false QoS Class: BestEffort Node-Selectors: kuscia.secretflow/namespace=renhang Tolerations: kuscia.secretflow/agent:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message


Warning FailedScheduling 3m16s kuscia-scheduler 0/1 nodes are available: failed to get task resource renhang/ for pod. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod., can not find related task resource. Normal Scheduled 3m14s kuscia-scheduler Successfully assigned renhang/yqxxeraj-0 to root-kuscia-autonomy-renhang Normal Pulled 3m14s Agent Container image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/psi-anolis8:0.2.0.dev240123" already present on machine Normal Created 3m13s Agent Created container secretflow Normal Started 3m12s Agent Started container secretflow Warning MissingClusterDNS 3m11s (x4 over 3m15s) Agent pod: "yqxxeraj-0_renhang(0d848060-9ff7-42f3-9470-ce5c66fc3454)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy. [root@root-kuscia-autonomy-renhang kuscia]#

coderSun20201112 commented 4 months ago

我看历史issues中,有人提到avx、avx2,是不是对CPU有要求?

wenkesong-li commented 4 months ago

你好,avx、avx2需要cpu对avx指令集支持~

coderSun20201112 commented 4 months ago

你好,avx、avx2需要cpu对avx指令集支持~

那我KKRT/RR22执行失败,通过日志能看出是因为我方服务器不支持avx/avx2吗?如果不是avx/avx2的问题,那我该如何解决这个问题?

github-actions[bot] commented 3 months ago

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.