kubeflow / arena

A CLI for Kubeflow.
Apache License 2.0
733 stars 177 forks source link

"arene top job" couldn't detect metrics. #108

Open soolaugust opened 5 years ago

soolaugust commented 5 years ago

I follow guide: arena/docs/userguide/9-top-job-gpu-metric.md.

everything works as expect until last one, when I submit the tfjob anr use "arena to job" to check, the result shows like this:

ERRO[0000] gpu metric is not exist in prometheus for query  {__name__=~"nvidia_gpu_duty_cycle|nvidia_gpu_memory_used_bytes|nvidia_gpu_memory_total_bytes", pod_name=~""}
INSTANCE NAME  GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)  STATUS  NODE

image

soolaugust commented 5 years ago

Finally I found the reason cause this:

  1. Kubeadm disable read-only-port 10255 by default since 1.11, refet to kubeadm: Improve the kubelet default configuration security-wise , so cadvisor couldn't detect metrics by accessing 10255 port.

fixing this by change kubelet config file /var/lib/kubelet/config.yaml, add readOnlyPort: 10255., and restart kubelet: systemctl daemon-reload & systemctl restart kubelet.

  1. after fixing 1st problem, I found the result of getting pods is empty, this is fixed by Fix for top job metric and list (#106).
soolaugust commented 5 years ago

@cheyang

I think there should be some tips about 1st problem in guide. because current guide about "Monitor GPUs of the training job" is not working in later 1.11 version of Kubeadm.

soolaugust commented 5 years ago

/assign cheyang

cheyang commented 5 years ago

@xiaozhouX please take a look.

xiaozhouX commented 5 years ago

Thanks for your feedback! In GPU exporter, we will call 10255 port of node for getting node GPU allocation for pods. I'm thinking about replace it by read device-plugin's checkpoint file (in /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint ) , another way is reading cgroup devices.list file (which is done by cadvisor before). These two way will cause different behavior when there are hostIPC Pod. What's your suggestion? @cheyang @soolaugust

soolaugust commented 5 years ago

I am not familiar about these two ways. is there any reference? I want to dig into it.

yeya24 commented 5 years ago

So any progress here? Same issue here. @xiaozhouX

xiaozhouX commented 5 years ago

So any progress here? Same issue here. @xiaozhouX

For now, we can only open kubelet's 10255 port. We will solve this as soon as possible.