globus-labs / FLoX-prototype

Python library for serverless Federated Learning experiments.
Apache License 2.0
14 stars 1 forks source link

when executor_type="funcx" failing with ValueError: The tasks queue is empty, no tasks were submitted for training! #39

Open vinaBira opened 1 year ago

vinaBira commented 1 year ago

Getting error on quickstart_pytorch.py tutorial with executor_type="funcx": Traceback (most recent call last): File "flox/examples/quickstart_pytorch/quickstart_pytorch.py", line 130, in main() File "flox/examples/quickstart_pytorch/quickstart_pytorch.py", line 126, in main flox_controller.run_federated_learning() File "/home/edg4/FLoX/flox/controllers/MainController.py", line 563, in run_federated_learning tasks = self.on_model_broadcast() File "/home/edg4/FLoX/flox/controllers/MainController.py", line 371, in on_model_broadcast raise ValueError( ValueError: The tasks queue is empty, no tasks were submitted for training!

nikita-kotsehub commented 1 year ago

@vinaBira can you check if your endpoints are active? Your log console should print out the status of each endpoint. If all of them are offline, then no tasks were submitted for training, and therefore the loop terminated with the ValueError.

vinaBira commented 1 year ago

@nikita-kotsehub I am using edge devices and yes they are active....Is there any particular state of client machines we are looking for?

vinaBira commented 1 year ago

@nikita-kotsehub Please refer the screenshots attached and correct if configuration is wrong at any point.

Screen Shot 2023-07-18 at 12 58 18 PM Screen Shot 2023-07-18 at 12 58 04 PM
nikita-kotsehub commented 1 year ago

@vinaBira try to run simple funcX tasks on those endpoints before trying out flox. You can find tutorials for simple funcX tasks here: https://funcx.org/.

If you succeed in that, then there is some issue with the tasks not being submitted to the endpoints. I'd suggest you look through lines of code 311 - 373 in flox/controllers/MainController.py and try inserting logger or print statements to try to identify at which point the failure occurs.

Also, did you try running the examples under flox/examples? If not, reading instructions for setup might help.