secretflow / secretflow

A unified framework for privacy-preserving data analysis and machine learning
https://www.secretflow.org.cn/docs/secretflow/en/
Apache License 2.0
2.36k stars 397 forks source link

secretpad all in one中心化部署执行两方psi任务报错OOMKilled #1508

Open magic-hya opened 1 month ago

magic-hya commented 1 month ago

Issue Type

Bug

Source

binary

Secretflow Version

secretpad all in one 1.9.0b2

OS Platform and Distribution

CentOS Linux 7

Python version

3.10.13

Bazel version

No response

GCC/Compiler version

No response

What happend and What you expected to happen.

参与方上传了数据,也做了授权, 使用联合圈人模板
训练到隐私求交步骤失败

Reproduction code to reproduce the issue.

具体日志

2024-09-24 17:59:56 INFO the jobId=nspz, taskId=nspz-eaaqicss-node-3 start ...
2024-09-24 18:00:13 INFO the jobId=nspz, taskId=nspz-eaaqicss-node-3 failed: party gawvtmko failed msg: container[secretflow] terminated state reason "OOMKilled", message: "rmat: \"csv\"
    null_strs: \"\"
  }
}
output_uris: \"nspz_eaaqicss_node_3_output_0\"
checkpoint_uri: \"cknspz-eaaqicss-node-3-output-0\"

--

2024-09-24 10:00:06,050|gawvtmko|INFO|secretflow|entry.py:comp_eval:185| 
--
*storage_config* 

type: \"local_fs\"
local_fs {
  wd: \"/tmp/sf_nspz-eaaqicss-node-3_gawvtmko\"
}

--

2024-09-24 10:00:06,050|gawvtmko|INFO|secretflow|entry.py:comp_eval:186| 
--
*cluster_config* 

desc {
  parties: \"kkfmcgjo\"
  parties: \"gawvtmko\"
  devices {
    name: \"spu\"
    type: \"spu\"
    parties: \"kkfmcgjo\"
    parties: \"gawvtmko\"
    config: \"{\\\"runtime_config\\\":{\\\"protocol\\\":\\\"SEMI2K\\\",\\\"field\\\":\\\"FM128\\\"},\\\"link_desc\\\":{\\\"connect_retry_times\\\":60,\\\"connect_retry_interval_ms\\\":1000,\\\"brpc_channel_protocol\\\":\\\"http\\\",\\\"brpc_channel_connection_type\\\":\\\"pooled\\\",\\\"recv_timeout_ms\\\":1200000,\\\"http_timeout_ms\\\":1200000}}\"
  }
  devices {
    name: \"heu\"
    type: \"heu\"
    parties: \"kkfmcgjo\"
    parties: \"gawvtmko\"
    config: \"{\\\"mode\\\": \\\"PHEU\\\", \\\"schema\\\": \\\"paillier\\\", \\\"key_size\\\": 2048}\"
  }
  ray_fed_config {
    cross_silo_comm_backend: \"brpc_link\"
  }
}
public_config {
  ray_fed_config {
    parties: \"kkfmcgjo\"
    parties: \"gawvtmko\"
    addresses: \"nspz-eaaqicss-node-3-0-fed.kkfmcgjo.svc:80\"
    addresses: \"0.0.0.0:28234\"
  }
  spu_configs {
    name: \"spu\"
    parties: \"kkfmcgjo\"
    parties: \"gawvtmko\"
    addresses: \"http://nspz-eaaqicss-node-3-0-spu.kkfmcgjo.svc:80\"
    addresses: \"0.0.0.0:28233\"
  }
}
private_config {
  self_party: \"gawvtmko\"
  ray_head_addr: \"nspz-eaaqicss-node-3-0-global.gawvtmko.svc:28235\"
}

--

2024-09-24 10:00:06,051|gawvtmko|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment.
2024-09-24 10:00:06,051\tINFO worker.py:1540 -- Connecting to existing Ray cluster at address: nspz-eaaqicss-node-3-0-global.gawvtmko.svc:28235...
2024-09-24 10:00:06,062|gawvtmko|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140013885228272 on /tmp/ray/session_2024-09-24_10-00-03_769745_226/node_ip_address.json.lock
2024-09-24 10:00:06,063|gawvtmko|DEBUG|secretflow|_api.py:acquire:334| Lock 140013885228272 acquired on /tmp/ray/session_2024-09-24_10-00-03_769745_226/node_ip_address.json.lock
2024-09-24 10:00:06,063|gawvtmko|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140013885228272 on /tmp/ray/session_2024-09-24_10-00-03_769745_226/node_ip_address.json.lock
2024-09-24 10:00:06,064|gawvtmko|DEBUG|secretflow|_api.py:release:367| Lock 140013885228272 released on /tmp/ray/session_2024-09-24_10-00-03_769745_226/node_ip_address.json.lock
2024-09-24 10:00:06,067|gawvtmko|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140013885236096 on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,067|gawvtmko|DEBUG|secretflow|_api.py:acquire:334| Lock 140013885236096 acquired on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,067|gawvtmko|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140013885236096 on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,067|gawvtmko|DEBUG|secretflow|_api.py:release:367| Lock 140013885236096 released on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,067|gawvtmko|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140013885235280 on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,067|gawvtmko|DEBUG|secretflow|_api.py:acquire:334| Lock 140013885235280 acquired on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,068|gawvtmko|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140013885235280 on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,068|gawvtmko|DEBUG|secretflow|_api.py:release:367| Lock 140013885235280 released on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,068|gawvtmko|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140013885236096 on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,068|gawvtmko|DEBUG|secretflow|_api.py:acquire:334| Lock 140013885236096 acquired on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,068|gawvtmko|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140013885236096 on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,068|gawvtmko|DEBUG|secretflow|_api.py:release:367| Lock 140013885236096 released on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,068|gawvtmko|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140013885235280 on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,068|gawvtmko|DEBUG|secretflow|_api.py:acquire:334| Lock 140013885235280 acquired on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,068|gawvtmko|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140013885235280 on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,069|gawvtmko|DEBUG|secretflow|_api.py:release:367| Lock 140013885235280 released on /tmp/ray/session_2024-09-24_10-00-03_769745_226/ports_by_node.json.lock
2024-09-24 10:00:06,069\tINFO worker.py:1724 -- Connected to Ray cluster.
2024-09-24 10:00:08.501 INFO api.py:233 [gawvtmko] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'kkfmcgjo': 'http://nspz-eaaqicss-node-3-0-fed.kkfmcgjo.svc:80', 'gawvtmko': '0.0.0.0:28234'}, 'CURRENT_PARTY_NAME': 'gawvtmko', 'TLS_CONFIG': {}}
\x1b[33m(raylet)\x1b[0m [2024-09-24 10:00:08,408 I 1153 1153] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
2024-09-24 10:00:11.799 ERROR entry.py:577 [gawvtmko] -- [Anonymous_job] comp_eval exception
Traceback (most recent call last):
  File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\", line 575, in main
    res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py\", line 190, in comp_eval
    res = comp.eval(
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\", line 1187, in eval
    self._setup_sf_cluster(cluster_config)
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\", line 995, in _setup_sf_cluster
    init(
  File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line 589, in init
    fed.init(
  File \"/usr/local/lib/python3.10/site-packages/fed/api.py\", line 244, in init
    _start_sender_receiver_proxy(
  File \"/usr/local/lib/python3.10/site-packages/fed/proxy/barriers.py\", line 461, in _start_sender_receiver_proxy
    server_state = ray.get(
  File \"/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py\", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File \"/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py\", line 103, in wrapper
    return func(*args, **kwargs)
  File \"/usr/local/lib/python3.10/site-packages/ray/_private/worker.py\", line 2626, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
\tclass_name: SenderReceiverProxyActor
\tactor_id: 74f58647f1b1b66a5c1e9d9e01000000
\tpid: 2426
\tname: SenderReceiverProxyActor
\tnamespace: d7f020d5-1209-44eb-8b45-2f9d8ad1e78f
\tip: nspz-eaaqicss-node-3-0-global.gawvtmko.svc
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
"
wangzul commented 1 month ago

另一方的日志也提供一下[整个日志文件给一下]。方便定位原因。

magic-hya commented 1 month ago

现在日志只能看到center的,其他的日志从哪里获取

wangzul commented 1 month ago

现在日志只能看到center的,其他的日志从哪里获取

方式一 根据官网链接 【查看作业失败原因->查看任务 Pod 的详细信息->查看任务 Pod 详细日志】 方式二 docker inspect -f '{{ range .Mounts }}{{ if eq .Destination "/home/kuscia/var/stdout" }}{{ . }}{{ end }}{{ end }}' kuscia lite节点name或id 输出格式如下 {bind xxxx /home/kuscia/var/stdout true rprivate} xxxx就是你本地挂载日志的目录

wangzul commented 1 month ago

现在日志只能看到center的,其他的日志从哪里获取

方式一 根据官网链接 【查看作业失败原因->查看任务 Pod 的详细信息->查看任务 Pod 详细日志】 方式二 docker inspect -f '{{ range .Mounts }}{{ if eq .Destination "/home/kuscia/var/stdout" }}{{ . }}{{ end }}{{ end }}' kuscia lite节点name或id 输出格式如下 {bind xxxx /home/kuscia/var/stdout true rprivate} xxxx就是你本地挂载日志的目录

现在日志只能看到center的,其他的日志从哪里获取

方式一 根据官网链接 【查看作业失败原因->查看任务 Pod 的详细信息->查看任务 Pod 详细日志】 方式二 docker inspect -f '{{ range .Mounts }}{{ if eq .Destination "/home/kuscia/var/stdout" }}{{ . }}{{ end }}{{ end }}' kuscia lite节点name或id 输出格式如下 {bind xxxx /home/kuscia/var/stdout true rprivate} xxxx就是你本地挂载日志的目录

2种方式都需要现在master节点中查看到作业任务名称 kubectl get kj -n cross-domain

magic-hya commented 1 month ago

1.进入节点,输入命令kubectl get kj报错

bash-5.2# kubectl get kj
E0925 11:14:52.820799   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.821673   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.823168   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.823594   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.825292   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?

2.center容器内能看到任务名称

bash-5.2# kubectl get kj -n cross-domain
NAME   STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
kbui   47h         47h              47h                 Succeeded
rsxg   47h         47h              47h                 Succeeded
mtcu   45h         45h              45h                 Succeeded
dikc   45h         45h              45h                 Succeeded
buis   43h         43h              43h                 Succeeded
faja   43h         43h              43h                 Succeeded
csnn   17h         17h              17h                 Failed
qmdm   17h         17h              17h                 Failed
nspz   17h         17h              17h                 Failed
lzzp   13m         12m              12m                 Failed

3.进入映射目录获取日志如下 73.log 74.log 出错在74节点

wangzul commented 1 month ago

1.进入节点,输入命令kubectl get kj报错

bash-5.2# kubectl get kj
E0925 11:14:52.820799   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.821673   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.823168   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.823594   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.825292   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?

2.center容器内能看到任务名称

bash-5.2# kubectl get kj -n cross-domain
NAME   STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
kbui   47h         47h              47h                 Succeeded
rsxg   47h         47h              47h                 Succeeded
mtcu   45h         45h              45h                 Succeeded
dikc   45h         45h              45h                 Succeeded
buis   43h         43h              43h                 Succeeded
faja   43h         43h              43h                 Succeeded
csnn   17h         17h              17h                 Failed
qmdm   17h         17h              17h                 Failed
nspz   17h         17h              17h                 Failed
lzzp   13m         12m              12m                 Failed

3.进入映射目录获取日志如下 73.log 74.log 出错在74节点

执行的任务中使用的是allinone自带的数据还是自定义的。

wangzul commented 1 month ago

1.进入节点,输入命令kubectl get kj报错

bash-5.2# kubectl get kj
E0925 11:14:52.820799   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.821673   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.823168   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.823594   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.825292   28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?

2.center容器内能看到任务名称

bash-5.2# kubectl get kj -n cross-domain
NAME   STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
kbui   47h         47h              47h                 Succeeded
rsxg   47h         47h              47h                 Succeeded
mtcu   45h         45h              45h                 Succeeded
dikc   45h         45h              45h                 Succeeded
buis   43h         43h              43h                 Succeeded
faja   43h         43h              43h                 Succeeded
csnn   17h         17h              17h                 Failed
qmdm   17h         17h              17h                 Failed
nspz   17h         17h              17h                 Failed
lzzp   13m         12m              12m                 Failed

3.进入映射目录获取日志如下 73.log 74.log 出错在74节点

执行的任务中使用的是allinone自带的数据还是自定义的。

同时可以提供一下kuscia节点的内存配置,方便快速定位问题点。 free -h 宿主机执行 docker stats [分别是master alice bob kuscia的节点名称或id]

magic-hya commented 1 month ago

1.数据是自定义的数据,并没有看到自带的数据 psi_guest.csv psi_host.csv 2.内存配置如下

[root@k8s-master73 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           251G         49G         34G         21G        167G        171G
Swap:            0B          0B
[root@k8s-master74 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           251G         34G        119G        3.4G         97G        211G
Swap:            0B          0B          0B
[root@k8s-master75 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           251G        146G         25G        4.1G         80G         99G
Swap:            0B          0B

master
[root@k8s-master75 ~]# docker stats root-kuscia-master
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
0e093b4dee36   root-kuscia-master   3.13%     870.8MiB / 2GiB     42.52%    866MB / 1.29GB   8.19kB / 26.9GB   298
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
0e093b4dee36   root-kuscia-master   3.13%     870.8MiB / 2GiB     42.52%    866MB / 1.29GB   8.19kB / 26.9GB   298
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
0e093b4dee36   root-kuscia-master   17.00%    870.6MiB / 2GiB     42.51%    866MB / 1.29GB   8.19kB / 26.9GB   298
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
0e093b4dee36   root-kuscia-master   17.00%    870.6MiB / 2GiB     42.51%    866MB / 1.29GB   8.19kB / 26.9GB   298
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS

73
[root@k8s-master73 ~]# docker stats root-kuscia-lite-kkfmcgjo
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
4c1faca9b35b   root-kuscia-lite-kkfmcgjo   11.50%    1.062GiB / 4GiB     26.54%    134MB / 97.7MB   1.89GB / 3.72GB   327
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
4c1faca9b35b   root-kuscia-lite-kkfmcgjo   11.50%    1.062GiB / 4GiB     26.54%    134MB / 97.7MB   1.89GB / 3.72GB   327
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
4c1faca9b35b   root-kuscia-lite-kkfmcgjo   9.41%     1.062GiB / 4GiB     26.54%    134MB / 97.7MB   1.89GB / 3.72GB   327
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
4c1faca9b35b   root-kuscia-lite-kkfmcgjo   9.41%     1.062GiB / 4GiB     26.54%    134MB / 97.7MB   1.89GB / 3.72GB   327
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
4c1faca9b35b   root-kuscia-lite-kkfmcgjo   4.72%     1.062GiB / 4GiB     26.54%    134MB / 97.7MB   1.89GB / 3.72GB   327

74
[root@k8s-master74 ~]# docker stats root-kuscia-lite-gawvtmko
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
98493fda531f   root-kuscia-lite-gawvtmko   2.96%     857.5MiB / 4GiB     20.93%    129MB / 96.2MB   1.67GB / 2.58GB   336
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
98493fda531f   root-kuscia-lite-gawvtmko   2.96%     857.5MiB / 4GiB     20.93%    129MB / 96.2MB   1.67GB / 2.58GB   336
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
98493fda531f   root-kuscia-lite-gawvtmko   4.30%     857.3MiB / 4GiB     20.93%    129MB / 96.2MB   1.67GB / 2.58GB   336
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
98493fda531f   root-kuscia-lite-gawvtmko   4.30%     857.3MiB / 4GiB     20.93%    129MB / 96.2MB   1.67GB / 2.58GB   336
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
98493fda531f   root-kuscia-lite-gawvtmko   4.53%     857.5MiB / 4GiB     20.93%    129MB / 96.2MB   1.67GB / 2.58GB   336
wangzul commented 1 month ago

1.数据是自定义的数据,并没有看到自带的数据 psi_guest.csv psi_host.csv 2.内存配置如下

[root@k8s-master73 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           251G         49G         34G         21G        167G        171G
Swap:            0B          0B
[root@k8s-master74 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           251G         34G        119G        3.4G         97G        211G
Swap:            0B          0B          0B
[root@k8s-master75 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           251G        146G         25G        4.1G         80G         99G
Swap:            0B          0B

master
[root@k8s-master75 ~]# docker stats root-kuscia-master
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
0e093b4dee36   root-kuscia-master   3.13%     870.8MiB / 2GiB     42.52%    866MB / 1.29GB   8.19kB / 26.9GB   298
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
0e093b4dee36   root-kuscia-master   3.13%     870.8MiB / 2GiB     42.52%    866MB / 1.29GB   8.19kB / 26.9GB   298
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
0e093b4dee36   root-kuscia-master   17.00%    870.6MiB / 2GiB     42.51%    866MB / 1.29GB   8.19kB / 26.9GB   298
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
0e093b4dee36   root-kuscia-master   17.00%    870.6MiB / 2GiB     42.51%    866MB / 1.29GB   8.19kB / 26.9GB   298
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS

73
[root@k8s-master73 ~]# docker stats root-kuscia-lite-kkfmcgjo
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
4c1faca9b35b   root-kuscia-lite-kkfmcgjo   11.50%    1.062GiB / 4GiB     26.54%    134MB / 97.7MB   1.89GB / 3.72GB   327
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
4c1faca9b35b   root-kuscia-lite-kkfmcgjo   11.50%    1.062GiB / 4GiB     26.54%    134MB / 97.7MB   1.89GB / 3.72GB   327
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
4c1faca9b35b   root-kuscia-lite-kkfmcgjo   9.41%     1.062GiB / 4GiB     26.54%    134MB / 97.7MB   1.89GB / 3.72GB   327
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
4c1faca9b35b   root-kuscia-lite-kkfmcgjo   9.41%     1.062GiB / 4GiB     26.54%    134MB / 97.7MB   1.89GB / 3.72GB   327
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
4c1faca9b35b   root-kuscia-lite-kkfmcgjo   4.72%     1.062GiB / 4GiB     26.54%    134MB / 97.7MB   1.89GB / 3.72GB   327

74
[root@k8s-master74 ~]# docker stats root-kuscia-lite-gawvtmko
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
98493fda531f   root-kuscia-lite-gawvtmko   2.96%     857.5MiB / 4GiB     20.93%    129MB / 96.2MB   1.67GB / 2.58GB   336
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
98493fda531f   root-kuscia-lite-gawvtmko   2.96%     857.5MiB / 4GiB     20.93%    129MB / 96.2MB   1.67GB / 2.58GB   336
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
98493fda531f   root-kuscia-lite-gawvtmko   4.30%     857.3MiB / 4GiB     20.93%    129MB / 96.2MB   1.67GB / 2.58GB   336
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
98493fda531f   root-kuscia-lite-gawvtmko   4.30%     857.3MiB / 4GiB     20.93%    129MB / 96.2MB   1.67GB / 2.58GB   336
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
98493fda531f   root-kuscia-lite-gawvtmko   4.53%     857.5MiB / 4GiB     20.93%    129MB / 96.2MB   1.67GB / 2.58GB   336

辛苦提供一下日志 进入kuscia lite 节点目录下/home/kuscia/var/logs/envoy 这2个日志文件external.log internal.log /home/kuscia/var/logs/kuscia.log
共3个日志

magic-hya commented 1 month ago

在报错的74节点拿到日志文件 kuscia.log internal.log external.log

magic-hya commented 1 month ago

重新部署master notls后,求交报错

2024-09-25 19:28:53 INFO the jobId=anpo, taskId=anpo-oronmgzc-node-3 start ...
2024-09-25 19:29:45 INFO the jobId=anpo, taskId=anpo-oronmgzc-node-3 failed: party kvlohttz failed msg: container[secretflow] terminated state reason "Error", message: " log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
\x1b[36m(SenderReceiverProxyActor pid=3059)\x1b[0m 2024-09-25 11:29:43.660 INFO link.py:38 [kvlohttz] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True}
2024-09-25 11:29:43.721 ERROR component.py:1197 [kvlohttz] -- [Anonymous_job] eval on domain: \"data_prep\"
name: \"psi\"
version: \"0.0.7\"
attr_paths: \"input/input_table_1/key\"
attr_paths: \"input/input_table_2/key\"
attr_paths: \"protocol\"
attr_paths: \"sort_result\"
attr_paths: \"allow_duplicate_keys\"
attr_paths: \"allow_duplicate_keys/no/skip_duplicates_check\"
attr_paths: \"allow_duplicate_keys/no/receiver_parties\"
attr_paths: \"ecdh_curve\"
attrs {
  ss: \"id\"
}
attrs {
  ss: \"id\"
}
attrs {
  s: \"PROTOCOL_RR22\"
}
attrs {
  b: true
}
attrs {
  s: \"no\"
}
attrs {
  is_na: true
}
attrs {
  ss: \"kvlohttz\"
}
attrs {
  s: \"CURVE_FOURQ\"
}
inputs {
  name: \"psi_73\"
  type: \"sf.table.individual\"
  meta {
    type_url: \"type.googleapis.com/secretflow.spec.v1.IndividualTable\"
    value: \"\
\\177\
\\002id\\022\\001y\\022\\002x0\\022\\002x1\\022\\002x2\\022\\002x3\\022\\002x4\\022\\002x5\\022\\002x6\\022\\002x7\\022\\002x8\\022\\002x9\\\"\\003int*\\003int*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float\\020\\377\\377\\377\\377\\377\\377\\377\\377\\377\\001\"
  }
  data_refs {
    uri: \"psi_guest_761080836.csv\"
    party: \"kvlohttz\"
    format: \"csv\"
    null_strs: \"\"
  }
}
inputs {
  name: \"psi_74\"
  type: \"sf.table.individual\"
  meta {
    type_url: \"type.googleapis.com/secretflow.spec.v1.IndividualTable\"
    value: \"\
\\357\\001\
\\002id\\022\\002x0\\022\\002x1\\022\\002x2\\022\\002x3\\022\\002x4\\022\\002x5\\022\\002x6\\022\\002x7\\022\\002x8\\022\\002x9\\022\\003x10\\022\\003x11\\022\\003x12\\022\\003x13\\022\\003x14\\022\\003x15\\022\\003x16\\022\\003x17\\022\\003x18\\022\\003x19\\\"\\003int*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float\\020\\377\\377\\377\\377\\377\\377\\377\\377\\377\\001\"
  }
  data_refs {
    uri: \"psi_host_2104018944.csv\"
    party: \"ghqbgirn\"
    format: \"csv\"
    null_strs: \"\"
  }
}
output_uris: \"anpo_oronmgzc_node_3_output_0\"
checkpoint_uri: \"ckanpo-oronmgzc-node-3-output-0\"
 failed
Traceback (most recent call last):
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\", line 1194, in eval
    ret = self.__eval_callback(ctx=ctx, **kwargs)
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py\", line 471, in two_party_balanced_psi_eval_fn
    input_path = get_input_path(ctx, [receiver_info, sender_info])
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py\", line 355, in get_input_path
    download_files(ctx, remote_path, download_path)
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\", line 618, in download_files
    wait(waits)
  File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line 213, in wait
    reveal([o.device(lambda o: None)(o) for o in objs])
  File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line 162, in reveal
    all_object = sfd.get(all_object_refs)
  File \"/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py\", line 156, in get
    return fed.get(object_refs)
  File \"/usr/local/lib/python3.10/site-packages/fed/api.py\", line 621, in get
    values = ray.get(ray_refs)
  File \"/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py\", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File \"/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py\", line 103, in wrapper
    return func(*args, **kwargs)
  File \"/usr/local/lib/python3.10/site-packages/ray/_private/worker.py\", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): \x1b[36mray::SenderReceiverProxyActor.get_data()\x1b[39m (pid=3059, ip=anpo-oronmgzc-node-3-0-global.kvlohttz.svc, actor_id=92b4cdca49b0e9a9b12b3fbc01000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fa0e422b160>)
  File \"/usr/local/lib/python3.10/site-packages/fed/proxy/barriers.py\", line 379, in get_data
    data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
  File \"/usr/local/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py\", line 109, in get_data
    msg = self._linker.recv(rank)
AttributeError: 'BrpcLinkSenderReceiverProxy' object has no attribute '_linker'
2024-09-25 11:29:43.758 INFO api.py:342 [kvlohttz] -- [Anonymous_job] Shutdowning rayfed intendedly...
2024-09-25 11:29:43.759 INFO api.py:356 [kvlohttz] -- [Anonymous_job] No wait for data sending.
2024-09-25 11:29:43.761 INFO message_queue.py:72 [kvlohttz] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit.
2024-09-25 11:29:43.761 INFO message_queue.py:72 [kvlohttz] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-09-25 11:29:43.761 INFO api.py:384 [kvlohttz] -- [Anonymous_job] Shutdowned rayfed.
2024-09-25 11:29:44.321 ERROR entry.py:577 [kvlohttz] -- [Anonymous_job] comp_eval exception
Traceback (most recent call last):
  File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\", line 575, in main
    res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py\", line 190, in comp_eval
    res = comp.eval(
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\", line 1199, in eval
    raise e
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\", line 1194, in eval
    ret = self.__eval_callback(ctx=ctx, **kwargs)
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py\", line 471, in two_party_balanced_psi_eval_fn
    input_path = get_input_path(ctx, [receiver_info, sender_info])
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py\", line 355, in get_input_path
    download_files(ctx, remote_path, download_path)
  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\", line 618, in download_files
    wait(waits)
  File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line 213, in wait
    reveal([o.device(lambda o: None)(o) for o in objs])
  File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line 162, in reveal
    all_object = sfd.get(all_object_refs)
  File \"/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py\", line 156, in get
    return fed.get(object_refs)
  File \"/usr/local/lib/python3.10/site-packages/fed/api.py\", line 621, in get
    values = ray.get(ray_refs)
  File \"/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py\", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File \"/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py\", line 103, in wrapper
    return func(*args, **kwargs)
  File \"/usr/local/lib/python3.10/site-packages/ray/_private/worker.py\", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): \x1b[36mray::SenderReceiverProxyActor.get_data()\x1b[39m (pid=3059, ip=anpo-oronmgzc-node-3-0-global.kvlohttz.svc, actor_id=92b4cdca49b0e9a9b12b3fbc01000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fa0e422b160>)
  File \"/usr/local/lib/python3.10/site-packages/fed/proxy/barriers.py\", line 379, in get_data
    data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
  File \"/usr/local/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py\", line 109, in get_data
    msg = self._linker.recv(rank)
AttributeError: 'BrpcLinkSenderReceiverProxy' object has no attribute '_linker'

通信日志有503

bash-5.2# tail -f internal.log
127.0.0.1 - [25/Sep/2024:11:29:41 +0000] kvlohttz kuscia-handshake.ghqbgirn.svc "GET /handshake HTTP/1.1" 58dba76502db6f22 58dba76502db6f22 200 - - 288 15 0 15 0 - -
10.88.0.3 - [25/Sep/2024:11:29:41 +0000] kvlohttz anpo-oronmgzc-node-3-0-spu.ghqbgirn.svc "POST /org.interconnection.link.ReceiverService/Push HTTP/1.1" 0e6eab82c865c213 0e6eab82c865c213 503 - 48 579 1 0 1 0 - -
10.88.0.3 - [25/Sep/2024:11:29:42 +0000] kvlohttz anpo-oronmgzc-node-3-0-fed.ghqbgirn.svc "POST /org.interconnection.link.ReceiverService/Push HTTP/1.1" e5433faca87e305b e5433faca87e305b 503 - 40 579 1 0 1 0 - -
10.88.0.3 - [25/Sep/2024:11:29:42 +0000] kvlohttz anpo-oronmgzc-node-3-0-fed.ghqbgirn.svc "POST /org.interconnection.link.ReceiverService/Push HTTP/1.1" 6747dedbab9e8014 6747dedbab9e8014 503 - 149 579 1 0 1 0 - -