uber / fiber

Distributed Computing for AI Made Simple
https://uber.github.io/fiber/
Apache License 2.0
1.04k stars 108 forks source link

Number of pending pool workers #54

Open prsasatt opened 3 years ago

prsasatt commented 3 years ago

I am trying to run the example pi estimation job on Azure AKS. Everything works fine except when I launch the job with 20 processes on a 4 node AKS cluster, only 4 pool workers are Running at any given time while rest are Pending. Why is fiber Running only one pool worker on a node. How to make all poolworker containers to run.

something@something:~/something$ kubectl get po NAME READY STATUS RESTARTS AGE fiber-pi-estimation-55jvn 1/1 Running 0 10m poolworker-1-b9bebeb5 1/1 Running 0 10m poolworker-10-28cbe01d 0/1 Pending 0 10m poolworker-11-f8719d87 0/1 Pending 0 10m poolworker-12-3e3f1b36 0/1 Pending 0 10m poolworker-13-1aa26fa0 0/1 Pending 0 10m poolworker-14-6beaac80 0/1 Pending 0 10m poolworker-15-09cde185 1/1 Running 0 10m poolworker-16-3cdce425 0/1 Pending 0 10m poolworker-17-298e0b8e 0/1 Pending 0 10m poolworker-18-183c14b9 0/1 Pending 0 10m poolworker-19-c5f2bac8 0/1 Pending 0 10m poolworker-2-0d8b15ce 1/1 Running 0 10m poolworker-20-49439e6d 0/1 Pending 0 10m poolworker-3-2cada42b 0/1 Pending 0 10m poolworker-4-55c718cc 0/1 Pending 0 10m poolworker-5-7f76632b 0/1 Pending 0 10m poolworker-6-35c68bd8 0/1 Pending 0 10m poolworker-7-84171445 1/1 Running 0 10m poolworker-8-c6d6d9ac 0/1 Pending 0 10m poolworker-9-81ac471d 0/1 Pending 0 10m

calio commented 3 years ago

Hi @prsasatt , the scheduling is done by Kubernetes. Fiber started 20 jobs and Kubernetes decides where to run them. It's likely there weren't enough resources so the rest of the jobs are pending. If it's not due to limited resources, you can try running some long-running computation and wait for all the poolworkers to be up.