volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.11k stars 949 forks source link

the headless svc of vcjob created later than pods #3753

Open tedli opened 2 days ago

tedli commented 2 days ago

Description

  1. create a vcjob
  2. replicas container entry set to nslookup <hostname>.<jobname> && sleep infinite
  3. the nslookup may fail due to at the time pod start, the headless has not yet been created

Steps to reproduce the issue

1. 2. 3.

Describe the results you received and expected

vcjob controller create watch and wait headless svc of vcjob ready, then create replica pods of job.

What version of Volcano are you using?

v1.10.0

Any other relevant information

No response

Monokaix commented 1 day ago

H,you mean should creating headless svc after pods are ready?

tedli commented 1 day ago

Hi @Monokaix , Thanks for reply. No, exactly the opposite. My problem is that, when pods started, the headless svc may occasionally not yet been created. So inside pod, something like dig or nslookup to find ip replicas will fail. My output of kubectl, is like this,


NAME    AGE
vcjob   12s    <-- svc

NAME                   AGE
vcjob-worker-0         46s    <-- pod

If the output goes like this, it means headless svc created after pod, result in nslookup of svc name `vcjob-worker-0.vcjob` fail.
Monokaix commented 1 day ago

Ok, but if svc is created first but pod is not ready, svc parsing will also fail because the pod IP is not ready yet?

tedli commented 1 day ago

Hi @Monokaix , Thanks for reply. Of course. look up svc won't got unready pod ip. but how to tell a pod is ready, it's up to the user, the user can adjust the health check logic with the pod probe to make sure it's ready. However user has no access to control when volcano creates the headless svc. In most time, lookup headless svc works as expect, occasionally the lookup fails because at that time, the svc had not been created. Also, it will be convenient, it add some extra field to vcjob or plugin, to let vcjob controller create the headless svc with a service.alpha.kubernetes.io/tolerate-unready-endpoints: "true" annotation. Anyway, the headless svc is used to discover other replica instance, not for service consumption.

Monokaix commented 1 day ago

So if just let svc create first before pod created,can your problem be solved?

tedli commented 1 day ago

Hi @Monokaix Thanks for reply. Yes.