Open vinaBira opened 1 year ago
@vinaBira can you check if your endpoints are active? Your log console should print out the status of each endpoint. If all of them are offline, then no tasks were submitted for training, and therefore the loop terminated with the ValueError.
@nikita-kotsehub I am using edge devices and yes they are active....Is there any particular state of client machines we are looking for?
@nikita-kotsehub Please refer the screenshots attached and correct if configuration is wrong at any point.
@vinaBira try to run simple funcX tasks on those endpoints before trying out flox. You can find tutorials for simple funcX tasks here: https://funcx.org/.
If you succeed in that, then there is some issue with the tasks not being submitted to the endpoints. I'd suggest you look through lines of code 311 - 373 in flox/controllers/MainController.py and try inserting logger or print statements to try to identify at which point the failure occurs.
Also, did you try running the examples under flox/examples? If not, reading instructions for setup might help.
Getting error on quickstart_pytorch.py tutorial with executor_type="funcx": Traceback (most recent call last): File "flox/examples/quickstart_pytorch/quickstart_pytorch.py", line 130, in
main()
File "flox/examples/quickstart_pytorch/quickstart_pytorch.py", line 126, in main
flox_controller.run_federated_learning()
File "/home/edg4/FLoX/flox/controllers/MainController.py", line 563, in run_federated_learning
tasks = self.on_model_broadcast()
File "/home/edg4/FLoX/flox/controllers/MainController.py", line 371, in on_model_broadcast
raise ValueError(
ValueError: The tasks queue is empty, no tasks were submitted for training!