secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
72 stars 49 forks source link

查询job状态的时候,kuscia返回error #391

Open z00174311 opened 2 months ago

z00174311 commented 2 months ago

Issue Type

Api Usage

Search for existing issues similar to yours

Yes

Kuscia Version

0.7.0b0

Link to Relevant Documentation

No response

Question Details

在alice端create job之后,调用job querry进行查询job状态,反馈failed,相关error信息如下
“err_msg”:“The remaining no-failed party ta.sk counts 1are less than the threshold 2 that meets the conditions for ta.sk su.ccess.pending partyl],
running party[alice-partner],successfulpartyl],failed rparty [lbolb-partner]”,
请问如何定位bob端的失败原因?
lanyy9527 commented 2 months ago

您好,根据您提供的日志显示bob端任务失败导致,您可以根据下列信息进行排查:

  1. 在bob端的kuscia容器中,kubectl get kt job-name -n cross-domain -o yaml查看相关任务的报错日志信息;
  2. 检查bob端的资源状态(如内存、CPU、磁盘空间)是否充足;
  3. 通过docker stats检查bob端kuscia的容器资源是否设置为大于6G,如果不满足可以使用docker update --memory 调整内存资源;
  4. 检查kuscia容器中是否存在大量error的pod,error的pod可能会对资源有影响的,需要及时清理;