secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
73 stars 55 forks source link

使用kuscia 同时开启多个训练任务报错 #382

Closed JiaIcecream closed 4 months ago

JiaIcecream commented 4 months ago

Issue Type

Api Usage

Search for existing issues similar to yours

Yes

Kuscia Version

kuscia v0.8.0b0

Link to Relevant Documentation

No response

Question Details

我使用kuscia同时开启了三个纵向逻辑回归训练任务,发现会训练失败。psi组件的错误日志如下:
2024-07-17T15:47:17.179673002+08:00 stderr F Traceback (most recent call last):
2024-07-17T15:47:17.179689697+08:00 stderr F   File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-07-17T15:47:17.183318681+08:00 stderr F     return _run_code(code, main_globals, None,
2024-07-17T15:47:17.183328344+08:00 stderr F   File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
2024-07-17T15:47:17.183406804+08:00 stderr F     exec(code, run_globals)
2024-07-17T15:47:17.183412325+08:00 stderr F   File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in <module>
2024-07-17T15:47:17.183995132+08:00 stderr F     main()
2024-07-17T15:47:17.184000237+08:00 stderr F   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
2024-07-17T15:47:17.185233847+08:00 stderr F     return self.main(*args, **kwargs)
2024-07-17T15:47:17.185248528+08:00 stderr F   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
2024-07-17T15:47:17.185412815+08:00 stderr F     rv = self.invoke(ctx)
2024-07-17T15:47:17.185428709+08:00 stderr F   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
2024-07-17T15:47:17.185616281+08:00 stderr F     return ctx.invoke(self.callback, **ctx.params)
2024-07-17T15:47:17.185621678+08:00 stderr F   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
2024-07-17T15:47:17.185710346+08:00 stderr F     return __callback(*args, **kwargs)
2024-07-17T15:47:17.18572442+08:00 stderr F   File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 529, in main
2024-07-17T15:47:17.185817394+08:00 stderr F     postprocess_sf_node_eval_result(
2024-07-17T15:47:17.185820945+08:00 stderr F   File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 470, in postprocess_sf_node_eval_result
2024-07-17T15:47:17.185897918+08:00 stderr F     create_domain_data_in_dm(domaindata_stub, domain_data)
2024-07-17T15:47:17.185901479+08:00 stderr F   File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/datamesh.py", line 87, in create_domain_data_in_dm
2024-07-17T15:47:17.186282431+08:00 stderr F     ret = stub.CreateDomainData(
2024-07-17T15:47:17.18628769+08:00 stderr F   File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1030, in __call__
2024-07-17T15:47:17.206709482+08:00 stderr F     return _end_unary_response_blocking(state, call, False, None)
2024-07-17T15:47:17.2067265+08:00 stderr F   File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
2024-07-17T15:47:17.206832067+08:00 stderr F     raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
2024-07-17T15:47:17.206932965+08:00 stderr F grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
2024-07-17T15:47:17.206943831+08:00 stderr F    status = StatusCode.UNAVAILABLE
2024-07-17T15:47:17.20694663+08:00 stderr F     details = "Socket closed"
2024-07-17T15:47:17.20694958+08:00 stderr F     debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-07-17T07:47:17.160348456+00:00", grpc_status:14, grpc_message:"Socket closed"}"
2024-07-17T15:47:17.206951873+08:00 stderr F >
2024-07-17T15:47:17.263906907+08:00 stdout F ESC[36m(SPURuntime(device_id=None, party=com2023011620060497797) pid=1202)ESC[0m [2024-07-17 07:46:56.318] [info] [key.cc:91] Executing sort scripts: tail -n +2 /home/kuscia/var/storage/data/tmp-sort-in-96c368b7-e299-4f87-8fce-3906adbe4e63 | LC_ALL=C sort  --parallel=16 --buffer-size=1G --stable --field-separator=, --key=1,1  >>/home/kuscia/var/storage/data/tmp-sort-out-96c368b7-e299-4f87-8fce-3906adbe4e63
2024-07-17T15:47:17.263918978+08:00 stdout F ESC[36m(SPURuntime(device_id=None, party=com2023011620060497797) pid=1202)ESC[0m [2024-07-17 07:46:56.379] [info] [key.cc:93] Finished sort scripts: tail -n +2 /home/kuscia/var/storage/data/tmp-sort-in-96c368b7-e299-4f87-8fce-3906adbe4e63 | LC_ALL=C sort  --parallel=16 --buffer-size=1G --stable --field-separator=, --key=1,1  >>/home/kuscia/var/storage/data/tmp-sort-out-96c368b7-e299-4f87-8fce-3906adbe4e63, ret=0
2024-07-17T15:47:17.263921872+08:00 stdout F ESC[36m(SPURuntime(device_id=None, party=com2023011620060497797) pid=1202)ESC[0m [2024-07-17 07:46:56.382] [info] [interface.cc:218] [AbstractPsiParty::Finalize][Generate result] end
2024-07-17T15:47:17.263925361+08:00 stdout F ESC[36m(SPURuntime(device_id=None, party=com2023011620060497797) pid=1202)ESC[0m [2024-07-17 07:46:56.384] [info] [interface.cc:250] [AbstractPsiParty::Finalize] end
2024-07-17T15:47:17.263928503+08:00 stdout F ESC[36m(SPURuntime(device_id=None, party=com2023011620060497797) pid=1202)ESC[0m [2024-07-17 07:46:56.386] [info] [launch.cc:95] Trace has been written to /tmp/psi_75036749-10cc-4e85-9efc-6dca138ba225.trace.
2024-07-17T15:47:17.263932249+08:00 stdout F ESC[33m(raylet)ESC[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff63b15f17d28c423010d9e56001000000 Worker ID: 52a100b534e1e0c55fd93fdb57b4a964ad77f49750c93f146f543ff1 Node ID: b7bc7a079071805f2fabdf13107538313e19e4f0979dd92ac5591ba9 Worker IP address: reefr2024071715451667968-qagbmwrn-node-35-partner-0-global.com2023011620060497797.svc Worker port: 10020 Worker PID: 1202 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

请问我该怎么锁定具体的错误原因呢,同时,我该怎么解决这个问题,使得可以同时运行多个纵向逻辑回归训练任务。
lanyy9527 commented 4 months ago

您好,麻烦提供下系统配置、内存、磁盘、数据大小 这些信息

JiaIcecream commented 4 months ago

您好,麻烦提供下系统配置、内存、磁盘、数据大小 这些信息

你好,我们使用的是虚拟机,配置如下: 架构: x86_64 CPU 运行模式: 32-bit, 64-bit 字节序: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU: 16 在线 CPU 列表: 0-15 每个核的线程数: 1 每个座的核数: 4 座: 4 NUMA 节点: 1 厂商 ID: GenuineIntel CPU 系列: 6 型号: 85 型号名称: Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz 步进: 11 CPU MHz: 2494.140 BogoMIPS: 4988.28 超管理器厂商: Microsoft 虚拟化类型: 完全 L1d 缓存: 256 KiB L1i 缓存: 256 KiB L2 缓存: 8 MiB L3 缓存: 24.8 MiB NUMA 节点0 CPU: 0-15

运行任务时,内存的使用情况 总计 已用 空闲 共享 缓冲/缓存 可用 内存: 64298 47132 1162 3381 16004 13078 交换: 2047 2047 0

数据使用的是kuscia默认的datasource 下的 alice.csv 以及 bob.csv

lanyy9527 commented 4 months ago

您好,根据您提供的信息,可以按照下列思路来排查oom问题:

  1. 当前您的系统空闲约1GB,确认是否能为kuscia容器分配足够的可用内存,kuscia容器最低要求分配6GB;
  2. 通过docker stats查看kuscia节点的内存是否满足6G,如果不满足可以使用docker update --memory 调整内存资源;
  3. 在系统可用内存足够的情况下,如果执行任务仍然提示oom异常,也可以通过top查看内存使用较高的其它进程,如果存在可以进行kill掉重新执行任务;
  4. 尝试重启docker看是否可行;
JiaIcecream commented 4 months ago

您好,根据您提供的信息,可以按照下列思路来排查oom问题:

  1. 当前您的系统空闲约1GB,确认是否能为kuscia容器分配足够的可用内存,kuscia容器最低要求分配6GB;
  2. 通过docker stats查看kuscia节点的内存是否满足6G,如果不满足可以使用docker update --memory 调整内存资源;
  3. 在系统可用内存足够的情况下,如果执行任务仍然提示oom异常,也可以通过top查看内存使用较高的其它进程,如果存在可以进行kill掉重新执行任务;
  4. 尝试重启docker看是否可行;

排查了一下,确实是给kuscia容器分配的内存不足了。非常感谢你的帮助