mc2-project / federated-xgboost

Federated gradient boosted decision tree learning
68 stars 20 forks source link

failed in ReconnectLink #11

Closed zhangjinyiyi closed 4 years ago

zhangjinyiyi commented 4 years ago

I am working on federated xgboost, but I failed to run the basic demo in a virtual cluster built with vmware and ubuntu.

When the aggregator initiates the training, the participant parties will generation errors like "failed in ReconnectLink". I tried with one participant party, everything is fine.

The full log is shown below:

Starting RPC server on port  50051
Request from aggregator [ipv4:192.168.140.12:55970] to start federated training session:
Please enter 'Y' to confirm or 'N' to reject.
Join session? [Y/N]: Y
Starting federated training session
failed in ReconnectLink [15:15:16] /home/yi/federated-xgboost/federated-xgboost/rabit/include/rabit/internal/ssl_socket.h:26: NET - Connection was reset by peer
Stack trace:
  [bt] (0) /usr/local/lib/python3.6/dist-packages/federatedxgboost-0.90-py3.6.egg/federatedxgboost/./lib/libxgboost.so(+0x2e3b47) [0x7fddfe82bb47]
  [bt] (1) /usr/local/lib/python3.6/dist-packages/federatedxgboost-0.90-py3.6.egg/federatedxgboost/./lib/libxgboost.so(+0x2e4147) [0x7fddfe82c147]
  [bt] (2) /usr/local/lib/python3.6/dist-packages/federatedxgboost-0.90-py3.6.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceBase::ReConnectLinks(char const*)+0x15e1) [0x7fddfe829501]
  [bt] (3) /usr/local/lib/python3.6/dist-packages/federatedxgboost-0.90-py3.6.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceBase::Init(int, char**)+0x317) [0x7fddfe82a0b7]
  [bt] (4) /usr/local/lib/python3.6/dist-packages/federatedxgboost-0.90-py3.6.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceRobust::Init(int, char**)+0xe) [0x7fddfe8304de]
  [bt] (5) /usr/local/lib/python3.6/dist-packages/federatedxgboost-0.90-py3.6.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::Init(int, char**)+0x438) [0x7fddfe82c768]
  [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fde0c9d1e40]
  [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fde0c9d18ab]
  [bt] (8) /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2cf) [0x7fde0cc2598f]

Socket RecvAll Error:Connection reset by peer, shutting down process
podcastinator commented 4 years ago

Seems like the sample certificates in the demo/basic/certs folder had expired (these certificates are used during TLS setup for authentication). I just updated the certificates on this branch: https://github.com/mc2-project/federated-xgboost/tree/cert-patch Please checkout this branch and try the demo again?

chester-leung commented 4 years ago

I've merged in the corresponding fix (#12), so @zhangjinyiyi feel free to pull from master and try again!

zhangjinyiyi commented 4 years ago

I updated the certificates, now the demo works fine. Thank you very much for your help @podcastinator @chester-leung.