tencentmusic / cube-studio

cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA,支持pytorch/tf/mxnet/deepspeed/paddle/colossalai/horovod/spark/ray/volcano分布式
Other
3.63k stars 637 forks source link

集群加入gpu后服务报错 #153

Open zzzzzzyzz opened 1 year ago

zzzzzzyzz commented 1 year ago

我想问一下,就是在集群中加入了GPU之后,刚开始没报错,过了几个小时之后就是 cattle-cluster-agent canal coredns
kubeflow-prometheus-adapter 这些服务不停的重启更新,我看了一下大概是这三种,这种问题该怎么解决呢?
Readiness probe failed: Get http://10.42.0.14:9090/-/ready: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Liveness probe failed: Get http://10.42.0.2:8080/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 210.28.18.30 210.28.16.26 210.28.18.26

zzzzzzyzz commented 1 year ago

就是单机的集群配置都配好了之后,过一段时间就会在rancher中的许多服务里显示 Deployment does not have minimum availability. 然后看日志的话就是Readiness probe failed Liveness probe failed这样的问题
Readiness probe failed: Get http://localhost:9099/readiness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)类似这样的。请问这种情况是资源不够吗,还是别的原因呢?