secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
73 stars 55 forks source link

kuscia 发起任务出现资源不足的情况;查看机器资源是够的 #422

Closed john8628 closed 2 months ago

john8628 commented 2 months ago

Issue Type

Api Usage

Search for existing issues similar to yours

Yes

Kuscia Version

0.9.0b0

Link to Relevant Documentation

No response

Question Details


现象:在pad中进行一些性能测试之后,页面点击很卡;点击隐私求交,然后前端页面卡住,一直没有反应;

处理方式:kubectl get kt {jobId} -n namespace之后,发现如下的情况

message: '0/1 nodes are available: waiting for task resource. preemption: 0/1
        nodes are available: 1 Preemption is not helpful for scheduling., can not
        find related task resource.'
      namespace: palmpay
      podName: yeha-ahwutzwj-node-35-0
      podPhase: Failed
      reason: Unschedulable
  reason: KusciaJobStopped
  serviceStatuses:
    palmpay/yeha-ahwutzwj-node-35-0-fed:
      createTime: "2024-09-02T05:43:35Z"
      namespace: palmpay
      portName: fed
      portNumber: 31508
      scope: Cluster
      serviceName: yeha-ahwutzwj-node-35-0-fed
    palmpay/yeha-ahwutzwj-node-35-0-global:
      createTime: "2024-09-02T05:43:37Z"
      namespace: palmpay
      portName: global
      portNumber: 31509
      scope: Domain
      serviceName: yeha-ahwutzwj-node-35-0-global
    palmpay/yeha-ahwutzwj-node-35-0-spu:
      createTime: "2024-09-02T05:43:35Z"
      namespace: palmpay
      portName: spu
      portNumber: 31507
      scope: Cluster
      serviceName: yeha-ahwutzwj-node-35-0-spu
  startTime: "2024-09-02T05:43:35Z"
···
1139763082 commented 2 months ago

您先按照下面的流程排查下 1.查看psi任务日志 kubectl get pod xxx -o yaml 或者查看var/stdout/pod/任务日志是否有其他异常 2.查看节点通讯是否正常,指令:kubectl get cdr 3.查看 Internal 与 exeternal 日志是否有其他异常

john8628 commented 2 months ago

@1139763082 @zimu-yuxi 感谢支持,目前看是kuscia的元数据迁移导致的;