mc2-project / federated-xgboost

Federated gradient boosted decision tree learning
68 stars 20 forks source link

Getting stuck after confirm joining federated training session #31

Open luckystarufo opened 1 year ago

luckystarufo commented 1 year ago

Hello there,

I am trying to run the code on three different machines within a network. The three machines can communicate properly but the training never starts.

After digging a little bit, I find the the code stucks at the run() function within federated-xgboost/dmlc-core/tracker/dmlc_tracker/rpc.py:

Screen Shot 2022-10-28 at 5 00 47 PM

Here's what I saw on the server side:

Screen Shot 2022-10-28 at 5 02 13 PM

And here are the ones on the two clients side:

Screen Shot 2022-10-28 at 5 02 52 PM Screen Shot 2022-10-28 at 5 03 02 PM

Any suggestions? Thanks!