集群加入gpu后服务报错

tencentmusic / cube-studio

cube studio开源云原生一站式机器学习/深度学习/大模型AI平台，支持sso登录，多租户，大数据平台对接，notebook在线开发，拖拉拽任务流pipeline编排，多机多卡分布式训练，超参搜索，推理服务VGPU，边缘计算，serverless，标注平台，自动化标注，数据集管理，大模型微调，vllm大模型推理，llmops，私有知识库，AI模型应用商店，支持模型一键开发/推理/微调，支持国产cpu/gpu/npu芯片，支持RDMA，支持pytorch/tf/mxnet/deepspeed/paddle/colossalai/horovod/spark/ray/volcano分布式

Other

3.63k stars 637 forks source link

我想问一下，就是在集群中加入了GPU之后，刚开始没报错，过了几个小时之后就是 cattle-cluster-agent canal coredns
kubeflow-prometheus-adapter 这些服务不停的重启更新，我看了一下大概是这三种，这种问题该怎么解决呢？
Readiness probe failed: Get http://10.42.0.14:9090/-/ready: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Liveness probe failed: Get http://10.42.0.2:8080/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 210.28.18.30 210.28.16.26 210.28.18.26

tencentmusic / cube-studio

集群加入gpu后服务报错 #153