Open magic-hya opened 1 month ago
另一方的日志也提供一下[整个日志文件给一下]。方便定位原因。
现在日志只能看到center的,其他的日志从哪里获取
现在日志只能看到center的,其他的日志从哪里获取
方式一 根据官网链接 【查看作业失败原因->查看任务 Pod 的详细信息->查看任务 Pod 详细日志】 方式二 docker inspect -f '{{ range .Mounts }}{{ if eq .Destination "/home/kuscia/var/stdout" }}{{ . }}{{ end }}{{ end }}' kuscia lite节点name或id 输出格式如下 {bind xxxx /home/kuscia/var/stdout true rprivate} xxxx就是你本地挂载日志的目录
现在日志只能看到center的,其他的日志从哪里获取
方式一 根据官网链接 【查看作业失败原因->查看任务 Pod 的详细信息->查看任务 Pod 详细日志】 方式二 docker inspect -f '{{ range .Mounts }}{{ if eq .Destination "/home/kuscia/var/stdout" }}{{ . }}{{ end }}{{ end }}' kuscia lite节点name或id 输出格式如下 {bind xxxx /home/kuscia/var/stdout true rprivate} xxxx就是你本地挂载日志的目录
现在日志只能看到center的,其他的日志从哪里获取
方式一 根据官网链接 【查看作业失败原因->查看任务 Pod 的详细信息->查看任务 Pod 详细日志】 方式二 docker inspect -f '{{ range .Mounts }}{{ if eq .Destination "/home/kuscia/var/stdout" }}{{ . }}{{ end }}{{ end }}' kuscia lite节点name或id 输出格式如下 {bind xxxx /home/kuscia/var/stdout true rprivate} xxxx就是你本地挂载日志的目录
2种方式都需要现在master节点中查看到作业任务名称 kubectl get kj -n cross-domain
1.进入节点,输入命令kubectl get kj报错
bash-5.2# kubectl get kj
E0925 11:14:52.820799 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.821673 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.823168 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.823594 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0925 11:14:52.825292 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?
2.center容器内能看到任务名称
bash-5.2# kubectl get kj -n cross-domain
NAME STARTTIME COMPLETIONTIME LASTRECONCILETIME PHASE
kbui 47h 47h 47h Succeeded
rsxg 47h 47h 47h Succeeded
mtcu 45h 45h 45h Succeeded
dikc 45h 45h 45h Succeeded
buis 43h 43h 43h Succeeded
faja 43h 43h 43h Succeeded
csnn 17h 17h 17h Failed
qmdm 17h 17h 17h Failed
nspz 17h 17h 17h Failed
lzzp 13m 12m 12m Failed
1.进入节点,输入命令kubectl get kj报错
bash-5.2# kubectl get kj E0925 11:14:52.820799 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0925 11:14:52.821673 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0925 11:14:52.823168 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0925 11:14:52.823594 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0925 11:14:52.825292 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused The connection to the server localhost:8080 was refused - did you specify the right host or port?
2.center容器内能看到任务名称
bash-5.2# kubectl get kj -n cross-domain NAME STARTTIME COMPLETIONTIME LASTRECONCILETIME PHASE kbui 47h 47h 47h Succeeded rsxg 47h 47h 47h Succeeded mtcu 45h 45h 45h Succeeded dikc 45h 45h 45h Succeeded buis 43h 43h 43h Succeeded faja 43h 43h 43h Succeeded csnn 17h 17h 17h Failed qmdm 17h 17h 17h Failed nspz 17h 17h 17h Failed lzzp 13m 12m 12m Failed
执行的任务中使用的是allinone自带的数据还是自定义的。
1.进入节点,输入命令kubectl get kj报错
bash-5.2# kubectl get kj E0925 11:14:52.820799 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0925 11:14:52.821673 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0925 11:14:52.823168 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0925 11:14:52.823594 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0925 11:14:52.825292 28753 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused The connection to the server localhost:8080 was refused - did you specify the right host or port?
2.center容器内能看到任务名称
bash-5.2# kubectl get kj -n cross-domain NAME STARTTIME COMPLETIONTIME LASTRECONCILETIME PHASE kbui 47h 47h 47h Succeeded rsxg 47h 47h 47h Succeeded mtcu 45h 45h 45h Succeeded dikc 45h 45h 45h Succeeded buis 43h 43h 43h Succeeded faja 43h 43h 43h Succeeded csnn 17h 17h 17h Failed qmdm 17h 17h 17h Failed nspz 17h 17h 17h Failed lzzp 13m 12m 12m Failed
执行的任务中使用的是allinone自带的数据还是自定义的。
同时可以提供一下kuscia节点的内存配置,方便快速定位问题点。 free -h 宿主机执行 docker stats [分别是master alice bob kuscia的节点名称或id]
1.数据是自定义的数据,并没有看到自带的数据 psi_guest.csv psi_host.csv 2.内存配置如下
[root@k8s-master73 ~]# free -h
total used free shared buff/cache available
Mem: 251G 49G 34G 21G 167G 171G
Swap: 0B 0B
[root@k8s-master74 ~]# free -h
total used free shared buff/cache available
Mem: 251G 34G 119G 3.4G 97G 211G
Swap: 0B 0B 0B
[root@k8s-master75 ~]# free -h
total used free shared buff/cache available
Mem: 251G 146G 25G 4.1G 80G 99G
Swap: 0B 0B
master
[root@k8s-master75 ~]# docker stats root-kuscia-master
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0e093b4dee36 root-kuscia-master 3.13% 870.8MiB / 2GiB 42.52% 866MB / 1.29GB 8.19kB / 26.9GB 298
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0e093b4dee36 root-kuscia-master 3.13% 870.8MiB / 2GiB 42.52% 866MB / 1.29GB 8.19kB / 26.9GB 298
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0e093b4dee36 root-kuscia-master 17.00% 870.6MiB / 2GiB 42.51% 866MB / 1.29GB 8.19kB / 26.9GB 298
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0e093b4dee36 root-kuscia-master 17.00% 870.6MiB / 2GiB 42.51% 866MB / 1.29GB 8.19kB / 26.9GB 298
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
73
[root@k8s-master73 ~]# docker stats root-kuscia-lite-kkfmcgjo
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
4c1faca9b35b root-kuscia-lite-kkfmcgjo 11.50% 1.062GiB / 4GiB 26.54% 134MB / 97.7MB 1.89GB / 3.72GB 327
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
4c1faca9b35b root-kuscia-lite-kkfmcgjo 11.50% 1.062GiB / 4GiB 26.54% 134MB / 97.7MB 1.89GB / 3.72GB 327
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
4c1faca9b35b root-kuscia-lite-kkfmcgjo 9.41% 1.062GiB / 4GiB 26.54% 134MB / 97.7MB 1.89GB / 3.72GB 327
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
4c1faca9b35b root-kuscia-lite-kkfmcgjo 9.41% 1.062GiB / 4GiB 26.54% 134MB / 97.7MB 1.89GB / 3.72GB 327
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
4c1faca9b35b root-kuscia-lite-kkfmcgjo 4.72% 1.062GiB / 4GiB 26.54% 134MB / 97.7MB 1.89GB / 3.72GB 327
74
[root@k8s-master74 ~]# docker stats root-kuscia-lite-gawvtmko
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
98493fda531f root-kuscia-lite-gawvtmko 2.96% 857.5MiB / 4GiB 20.93% 129MB / 96.2MB 1.67GB / 2.58GB 336
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
98493fda531f root-kuscia-lite-gawvtmko 2.96% 857.5MiB / 4GiB 20.93% 129MB / 96.2MB 1.67GB / 2.58GB 336
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
98493fda531f root-kuscia-lite-gawvtmko 4.30% 857.3MiB / 4GiB 20.93% 129MB / 96.2MB 1.67GB / 2.58GB 336
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
98493fda531f root-kuscia-lite-gawvtmko 4.30% 857.3MiB / 4GiB 20.93% 129MB / 96.2MB 1.67GB / 2.58GB 336
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
98493fda531f root-kuscia-lite-gawvtmko 4.53% 857.5MiB / 4GiB 20.93% 129MB / 96.2MB 1.67GB / 2.58GB 336
1.数据是自定义的数据,并没有看到自带的数据 psi_guest.csv psi_host.csv 2.内存配置如下
[root@k8s-master73 ~]# free -h total used free shared buff/cache available Mem: 251G 49G 34G 21G 167G 171G Swap: 0B 0B [root@k8s-master74 ~]# free -h total used free shared buff/cache available Mem: 251G 34G 119G 3.4G 97G 211G Swap: 0B 0B 0B [root@k8s-master75 ~]# free -h total used free shared buff/cache available Mem: 251G 146G 25G 4.1G 80G 99G Swap: 0B 0B master [root@k8s-master75 ~]# docker stats root-kuscia-master CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 0e093b4dee36 root-kuscia-master 3.13% 870.8MiB / 2GiB 42.52% 866MB / 1.29GB 8.19kB / 26.9GB 298 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 0e093b4dee36 root-kuscia-master 3.13% 870.8MiB / 2GiB 42.52% 866MB / 1.29GB 8.19kB / 26.9GB 298 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 0e093b4dee36 root-kuscia-master 17.00% 870.6MiB / 2GiB 42.51% 866MB / 1.29GB 8.19kB / 26.9GB 298 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 0e093b4dee36 root-kuscia-master 17.00% 870.6MiB / 2GiB 42.51% 866MB / 1.29GB 8.19kB / 26.9GB 298 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 73 [root@k8s-master73 ~]# docker stats root-kuscia-lite-kkfmcgjo CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 4c1faca9b35b root-kuscia-lite-kkfmcgjo 11.50% 1.062GiB / 4GiB 26.54% 134MB / 97.7MB 1.89GB / 3.72GB 327 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 4c1faca9b35b root-kuscia-lite-kkfmcgjo 11.50% 1.062GiB / 4GiB 26.54% 134MB / 97.7MB 1.89GB / 3.72GB 327 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 4c1faca9b35b root-kuscia-lite-kkfmcgjo 9.41% 1.062GiB / 4GiB 26.54% 134MB / 97.7MB 1.89GB / 3.72GB 327 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 4c1faca9b35b root-kuscia-lite-kkfmcgjo 9.41% 1.062GiB / 4GiB 26.54% 134MB / 97.7MB 1.89GB / 3.72GB 327 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 4c1faca9b35b root-kuscia-lite-kkfmcgjo 4.72% 1.062GiB / 4GiB 26.54% 134MB / 97.7MB 1.89GB / 3.72GB 327 74 [root@k8s-master74 ~]# docker stats root-kuscia-lite-gawvtmko CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 98493fda531f root-kuscia-lite-gawvtmko 2.96% 857.5MiB / 4GiB 20.93% 129MB / 96.2MB 1.67GB / 2.58GB 336 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 98493fda531f root-kuscia-lite-gawvtmko 2.96% 857.5MiB / 4GiB 20.93% 129MB / 96.2MB 1.67GB / 2.58GB 336 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 98493fda531f root-kuscia-lite-gawvtmko 4.30% 857.3MiB / 4GiB 20.93% 129MB / 96.2MB 1.67GB / 2.58GB 336 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 98493fda531f root-kuscia-lite-gawvtmko 4.30% 857.3MiB / 4GiB 20.93% 129MB / 96.2MB 1.67GB / 2.58GB 336 CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 98493fda531f root-kuscia-lite-gawvtmko 4.53% 857.5MiB / 4GiB 20.93% 129MB / 96.2MB 1.67GB / 2.58GB 336
辛苦提供一下日志
进入kuscia lite 节点目录下/home/kuscia/var/logs/envoy 这2个日志文件external.log internal.log
/home/kuscia/var/logs/kuscia.log
共3个日志
在报错的74节点拿到日志文件 kuscia.log internal.log external.log
重新部署master notls后,求交报错
2024-09-25 19:28:53 INFO the jobId=anpo, taskId=anpo-oronmgzc-node-3 start ...
2024-09-25 19:29:45 INFO the jobId=anpo, taskId=anpo-oronmgzc-node-3 failed: party kvlohttz failed msg: container[secretflow] terminated state reason "Error", message: " log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
\x1b[36m(SenderReceiverProxyActor pid=3059)\x1b[0m 2024-09-25 11:29:43.660 INFO link.py:38 [kvlohttz] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True}
2024-09-25 11:29:43.721 ERROR component.py:1197 [kvlohttz] -- [Anonymous_job] eval on domain: \"data_prep\"
name: \"psi\"
version: \"0.0.7\"
attr_paths: \"input/input_table_1/key\"
attr_paths: \"input/input_table_2/key\"
attr_paths: \"protocol\"
attr_paths: \"sort_result\"
attr_paths: \"allow_duplicate_keys\"
attr_paths: \"allow_duplicate_keys/no/skip_duplicates_check\"
attr_paths: \"allow_duplicate_keys/no/receiver_parties\"
attr_paths: \"ecdh_curve\"
attrs {
ss: \"id\"
}
attrs {
ss: \"id\"
}
attrs {
s: \"PROTOCOL_RR22\"
}
attrs {
b: true
}
attrs {
s: \"no\"
}
attrs {
is_na: true
}
attrs {
ss: \"kvlohttz\"
}
attrs {
s: \"CURVE_FOURQ\"
}
inputs {
name: \"psi_73\"
type: \"sf.table.individual\"
meta {
type_url: \"type.googleapis.com/secretflow.spec.v1.IndividualTable\"
value: \"\
\\177\
\\002id\\022\\001y\\022\\002x0\\022\\002x1\\022\\002x2\\022\\002x3\\022\\002x4\\022\\002x5\\022\\002x6\\022\\002x7\\022\\002x8\\022\\002x9\\\"\\003int*\\003int*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float\\020\\377\\377\\377\\377\\377\\377\\377\\377\\377\\001\"
}
data_refs {
uri: \"psi_guest_761080836.csv\"
party: \"kvlohttz\"
format: \"csv\"
null_strs: \"\"
}
}
inputs {
name: \"psi_74\"
type: \"sf.table.individual\"
meta {
type_url: \"type.googleapis.com/secretflow.spec.v1.IndividualTable\"
value: \"\
\\357\\001\
\\002id\\022\\002x0\\022\\002x1\\022\\002x2\\022\\002x3\\022\\002x4\\022\\002x5\\022\\002x6\\022\\002x7\\022\\002x8\\022\\002x9\\022\\003x10\\022\\003x11\\022\\003x12\\022\\003x13\\022\\003x14\\022\\003x15\\022\\003x16\\022\\003x17\\022\\003x18\\022\\003x19\\\"\\003int*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float\\020\\377\\377\\377\\377\\377\\377\\377\\377\\377\\001\"
}
data_refs {
uri: \"psi_host_2104018944.csv\"
party: \"ghqbgirn\"
format: \"csv\"
null_strs: \"\"
}
}
output_uris: \"anpo_oronmgzc_node_3_output_0\"
checkpoint_uri: \"ckanpo-oronmgzc-node-3-output-0\"
failed
Traceback (most recent call last):
File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\", line 1194, in eval
ret = self.__eval_callback(ctx=ctx, **kwargs)
File \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py\", line 471, in two_party_balanced_psi_eval_fn
input_path = get_input_path(ctx, [receiver_info, sender_info])
File \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py\", line 355, in get_input_path
download_files(ctx, remote_path, download_path)
File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\", line 618, in download_files
wait(waits)
File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line 213, in wait
reveal([o.device(lambda o: None)(o) for o in objs])
File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line 162, in reveal
all_object = sfd.get(all_object_refs)
File \"/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py\", line 156, in get
return fed.get(object_refs)
File \"/usr/local/lib/python3.10/site-packages/fed/api.py\", line 621, in get
values = ray.get(ray_refs)
File \"/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py\", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File \"/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py\", line 103, in wrapper
return func(*args, **kwargs)
File \"/usr/local/lib/python3.10/site-packages/ray/_private/worker.py\", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): \x1b[36mray::SenderReceiverProxyActor.get_data()\x1b[39m (pid=3059, ip=anpo-oronmgzc-node-3-0-global.kvlohttz.svc, actor_id=92b4cdca49b0e9a9b12b3fbc01000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fa0e422b160>)
File \"/usr/local/lib/python3.10/site-packages/fed/proxy/barriers.py\", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File \"/usr/local/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py\", line 109, in get_data
msg = self._linker.recv(rank)
AttributeError: 'BrpcLinkSenderReceiverProxy' object has no attribute '_linker'
2024-09-25 11:29:43.758 INFO api.py:342 [kvlohttz] -- [Anonymous_job] Shutdowning rayfed intendedly...
2024-09-25 11:29:43.759 INFO api.py:356 [kvlohttz] -- [Anonymous_job] No wait for data sending.
2024-09-25 11:29:43.761 INFO message_queue.py:72 [kvlohttz] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit.
2024-09-25 11:29:43.761 INFO message_queue.py:72 [kvlohttz] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-09-25 11:29:43.761 INFO api.py:384 [kvlohttz] -- [Anonymous_job] Shutdowned rayfed.
2024-09-25 11:29:44.321 ERROR entry.py:577 [kvlohttz] -- [Anonymous_job] comp_eval exception
Traceback (most recent call last):
File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\", line 575, in main
res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)
File \"/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py\", line 190, in comp_eval
res = comp.eval(
File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\", line 1199, in eval
raise e
File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\", line 1194, in eval
ret = self.__eval_callback(ctx=ctx, **kwargs)
File \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py\", line 471, in two_party_balanced_psi_eval_fn
input_path = get_input_path(ctx, [receiver_info, sender_info])
File \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py\", line 355, in get_input_path
download_files(ctx, remote_path, download_path)
File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\", line 618, in download_files
wait(waits)
File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line 213, in wait
reveal([o.device(lambda o: None)(o) for o in objs])
File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line 162, in reveal
all_object = sfd.get(all_object_refs)
File \"/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py\", line 156, in get
return fed.get(object_refs)
File \"/usr/local/lib/python3.10/site-packages/fed/api.py\", line 621, in get
values = ray.get(ray_refs)
File \"/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py\", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File \"/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py\", line 103, in wrapper
return func(*args, **kwargs)
File \"/usr/local/lib/python3.10/site-packages/ray/_private/worker.py\", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): \x1b[36mray::SenderReceiverProxyActor.get_data()\x1b[39m (pid=3059, ip=anpo-oronmgzc-node-3-0-global.kvlohttz.svc, actor_id=92b4cdca49b0e9a9b12b3fbc01000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fa0e422b160>)
File \"/usr/local/lib/python3.10/site-packages/fed/proxy/barriers.py\", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File \"/usr/local/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py\", line 109, in get_data
msg = self._linker.recv(rank)
AttributeError: 'BrpcLinkSenderReceiverProxy' object has no attribute '_linker'
通信日志有503
bash-5.2# tail -f internal.log
127.0.0.1 - [25/Sep/2024:11:29:41 +0000] kvlohttz kuscia-handshake.ghqbgirn.svc "GET /handshake HTTP/1.1" 58dba76502db6f22 58dba76502db6f22 200 - - 288 15 0 15 0 - -
10.88.0.3 - [25/Sep/2024:11:29:41 +0000] kvlohttz anpo-oronmgzc-node-3-0-spu.ghqbgirn.svc "POST /org.interconnection.link.ReceiverService/Push HTTP/1.1" 0e6eab82c865c213 0e6eab82c865c213 503 - 48 579 1 0 1 0 - -
10.88.0.3 - [25/Sep/2024:11:29:42 +0000] kvlohttz anpo-oronmgzc-node-3-0-fed.ghqbgirn.svc "POST /org.interconnection.link.ReceiverService/Push HTTP/1.1" e5433faca87e305b e5433faca87e305b 503 - 40 579 1 0 1 0 - -
10.88.0.3 - [25/Sep/2024:11:29:42 +0000] kvlohttz anpo-oronmgzc-node-3-0-fed.ghqbgirn.svc "POST /org.interconnection.link.ReceiverService/Push HTTP/1.1" 6747dedbab9e8014 6747dedbab9e8014 503 - 149 579 1 0 1 0 - -
Issue Type
Bug
Source
binary
Secretflow Version
secretpad all in one 1.9.0b2
OS Platform and Distribution
CentOS Linux 7
Python version
3.10.13
Bazel version
No response
GCC/Compiler version
No response
What happend and What you expected to happen.
Reproduction code to reproduce the issue.