mc2-project / federated-xgboost

Federated gradient boosted decision tree learning
68 stars 20 forks source link

failed to ReconnectLink error even after regenerating certificates #32

Closed MahmoudHamdy02 closed 1 year ago

MahmoudHamdy02 commented 1 year ago

I want to train on the demo dataset locally with 2 parties, but I get a failed in ReconnectLink error. It's mentioned in other issues that regenerating the certificates fixes this issue, but after running both bash files multiple times, the issue wasn't fixed.

I have two copies of the federated-xgboost folder, one starts the first party, and the other starts the second party and the aggregator.

The hosts.config file contains:

localhost:50051
localhost:50050

I run python3 serve.py from inside the hosts/basic folder. (I changed the port to 50050 in the second folder).

I run this aggregator command from the root federated-xgboost2 folder: dmlc-core/tracker/dmlc-submit --cluster rpc --num-workers 2 --host-file demo/basic/hosts.config --worker-memory 4g demo.py

Error from federated-xgboost2:

mahmoud@mahmoud-PC:~/work/federated-xgboost2/demo/basic$ python3 serve.py 
Starting RPC server on port  50050
Request from aggregator [ipv4:127.0.0.1:36282] to start federated training session:
Please enter 'Y' to confirm or 'N' to reject.
Join session? [Y/N]: Y
Starting federated training session
failed in ReconnectLink [17:04:31] /home/mahmoud/work/federated-xgboost/rabit/include/rabit/internal/ssl_socket.h:26: SSL - A fatal alert message was received from our peer
Stack trace:
  [bt] (0) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(+0x2d3518) [0x7fbcd04d3518]
  [bt] (1) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::utils::SSLTcpSocket::SSLHandshake()+0x37) [0x7fbcd04d3a27]
  [bt] (2) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceBase::ReConnectLinks(char const*)+0xf9b) [0x7fbcd04cd3cb]
  [bt] (3) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceRobust::Init(int, char**)+0x13) [0x7fbcd04d70d3]
  [bt] (4) /lib/x86_64-linux-gnu/libffi.so.8(+0x7e2e) [0x7fbd2e5fae2e]
  [bt] (5) /lib/x86_64-linux-gnu/libffi.so.8(+0x4493) [0x7fbd2e5f7493]
  [bt] (6) /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0xa451) [0x7fbd2d45f451]
  [bt] (7) /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x9a68) [0x7fbd2d45ea68]
  [bt] (8) python3(_PyObject_MakeTpCall+0x25b) [0x555fd8fc14ab]

Error from federated-xgboost:


mahmoud@mahmoud-PC:~/work/federated-xgboost/demo/basic$ python3 serve.py 
Starting RPC server on port  50051
Request from aggregator [ipv4:127.0.0.1:46978] to start federated training session:
Please enter 'Y' to confirm or 'N' to reject.
Join session? [Y/N]: Y
Starting federated training session
failed in ReconnectLink [17:04:31] /home/mahmoud/work/federated-xgboost/rabit/include/rabit/internal/ssl_socket.h:26: X509 - Certificate verification failed, e.g. CRL, CA or signature check failed
Stack trace:
  [bt] (0) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(+0x2d3518) [0x7f946acd3518]
  [bt] (1) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::utils::SSLTcpSocket::SSLHandshake()+0x37) [0x7f946acd3a27]
  [bt] (2) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceBase::ReConnectLinks(char const*)+0xdac) [0x7f946accd1dc]
  [bt] (3) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceRobust::Init(int, char**)+0x13) [0x7f946acd70d3]
  [bt] (4) /lib/x86_64-linux-gnu/libffi.so.8(+0x7e2e) [0x7f94c8e96e2e]
  [bt] (5) /lib/x86_64-linux-gnu/libffi.so.8(+0x4493) [0x7f94c8e93493]
  [bt] (6) /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0xa451) [0x7f94c8e46451]
  [bt] (7) /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x9a68) [0x7f94c8e45a68]
  [bt] (8) python3(_PyObject_MakeTpCall+0x25b) [0x55ae6fa164ab]
MahmoudHamdy02 commented 1 year ago

The steps in this comment fixed it for me, but I had to remove all certificate files from both machines and generate new ones