uwsampl / nexus

Other
82 stars 22 forks source link

frontend and scheduler error #19

Closed Told closed 4 years ago

Told commented 4 years ago

Hi,abcdabcd987! I follow the steps from https://github.com/uwsampl/nexus/blob/master/examples/README.md. Then I got frontend and scheduler error. frontend error log: Load model error: NOT_ENOUGH_BACKENDS frontend-error scheduler error log: failed to connect to all addresses scheduler-error

backend runs normal backend-log

what did I do wrong?

abcdabcd987 commented 4 years ago

Thanks for giving Nexus a try!

From the screenshot, I found a few possible reasons why this happens:

  1. Did you start the frontend before the backends? Frontends need to be started after backends are ready, otherwise it'll complain about not enough backends.
  2. The scheduler log shows that it can't find the profile. Have you profiled the model on the GPU? If you did profile, try ls $MODEL_DIR/profiles/ and see if the profile is actually inside the directory or subdirectories. Also a reminder that the profile is per GPU card, not per GPU kind. I see you have two GPUs, perhaps you just profiled one?
Told commented 4 years ago

Thanks for giving Nexus a try!

From the screenshot, I found a few possible reasons why this happens:

  1. Did you start the frontend before the backends? Frontends need to be started after backends are ready, otherwise it'll complain about not enough backends.
  2. The scheduler log shows that it can't find the profile. Have you profiled the model on the GPU? If you did profile, try ls $MODEL_DIR/profiles/ and see if the profile is actually inside the directory or subdirectories. Also a reminder that the profile is per GPU card, not per GPU kind. I see you have two GPUs, perhaps you just profiled one?

hi @abcdabcd987 , as you said, I deployed Nexus again. First, I started two profiler process on GPU 0 and 1, ls s $MODEL_DIR/profiles/ likes: profiler then I also deployed two backend process on GPU0 and GPU 1, and they had already registered to the scheduler: backend-0 backend-1 scheduler-reg But, finally I started frontend process, I still got the same error said NOT_ENOUGH_BACKENDS.: frontend-error-1

abcdabcd987 commented 4 years ago

I just realized that the instruction here is slightly outdated. You would also need to specify the image width and height when starting the front end. Like:

docker run -it --rm --gpus all --network=nexus-net --name=nexus-simple-frontend -p=9001 -p=9002 abcdabcd987/nexus \
    /nexus/build/simple -framework=tensorflow -model=resnet_0 -latency=50 -width=224 -height=224 -alsologtostderr -colorlogtostderr \
                        -sch_addr=nexus-scheduler:10001
Told commented 4 years ago

I just realized that the instruction here is slightly outdated. You would also need to specify the image width and height when starting the front end. Like:

docker run -it --rm --gpus all --network=nexus-net --name=nexus-simple-frontend -p=9001 -p=9002 abcdabcd987/nexus \
    /nexus/build/simple -framework=tensorflow -model=resnet_0 -latency=50 -width=224 -height=224 -alsologtostderr -colorlogtostderr \
                        -sch_addr=nexus-scheduler:10001

IT WORKS! Thanks. please update the example readme.