I want to train on the demo dataset locally with 2 parties, but I get a failed in ReconnectLink error. It's mentioned in other issues that regenerating the certificates fixes this issue, but after running both bash files multiple times, the issue wasn't fixed.
I have two copies of the federated-xgboost folder, one starts the first party, and the other starts the second party and the aggregator.
The hosts.config file contains:
localhost:50051
localhost:50050
I run python3 serve.py from inside the hosts/basic folder. (I changed the port to 50050 in the second folder).
I run this aggregator command from the root federated-xgboost2 folder:
dmlc-core/tracker/dmlc-submit --cluster rpc --num-workers 2 --host-file demo/basic/hosts.config --worker-memory 4g demo.py
Error from federated-xgboost2:
mahmoud@mahmoud-PC:~/work/federated-xgboost2/demo/basic$ python3 serve.py
Starting RPC server on port 50050
Request from aggregator [ipv4:127.0.0.1:36282] to start federated training session:
Please enter 'Y' to confirm or 'N' to reject.
Join session? [Y/N]: Y
Starting federated training session
failed in ReconnectLink [17:04:31] /home/mahmoud/work/federated-xgboost/rabit/include/rabit/internal/ssl_socket.h:26: SSL - A fatal alert message was received from our peer
Stack trace:
[bt] (0) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(+0x2d3518) [0x7fbcd04d3518]
[bt] (1) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::utils::SSLTcpSocket::SSLHandshake()+0x37) [0x7fbcd04d3a27]
[bt] (2) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceBase::ReConnectLinks(char const*)+0xf9b) [0x7fbcd04cd3cb]
[bt] (3) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceRobust::Init(int, char**)+0x13) [0x7fbcd04d70d3]
[bt] (4) /lib/x86_64-linux-gnu/libffi.so.8(+0x7e2e) [0x7fbd2e5fae2e]
[bt] (5) /lib/x86_64-linux-gnu/libffi.so.8(+0x4493) [0x7fbd2e5f7493]
[bt] (6) /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0xa451) [0x7fbd2d45f451]
[bt] (7) /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x9a68) [0x7fbd2d45ea68]
[bt] (8) python3(_PyObject_MakeTpCall+0x25b) [0x555fd8fc14ab]
Error from federated-xgboost:
mahmoud@mahmoud-PC:~/work/federated-xgboost/demo/basic$ python3 serve.py
Starting RPC server on port 50051
Request from aggregator [ipv4:127.0.0.1:46978] to start federated training session:
Please enter 'Y' to confirm or 'N' to reject.
Join session? [Y/N]: Y
Starting federated training session
failed in ReconnectLink [17:04:31] /home/mahmoud/work/federated-xgboost/rabit/include/rabit/internal/ssl_socket.h:26: X509 - Certificate verification failed, e.g. CRL, CA or signature check failed
Stack trace:
[bt] (0) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(+0x2d3518) [0x7f946acd3518]
[bt] (1) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::utils::SSLTcpSocket::SSLHandshake()+0x37) [0x7f946acd3a27]
[bt] (2) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceBase::ReConnectLinks(char const*)+0xdac) [0x7f946accd1dc]
[bt] (3) /usr/local/lib/python3.10/dist-packages/federatedxgboost-0.90-py3.10.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceRobust::Init(int, char**)+0x13) [0x7f946acd70d3]
[bt] (4) /lib/x86_64-linux-gnu/libffi.so.8(+0x7e2e) [0x7f94c8e96e2e]
[bt] (5) /lib/x86_64-linux-gnu/libffi.so.8(+0x4493) [0x7f94c8e93493]
[bt] (6) /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0xa451) [0x7f94c8e46451]
[bt] (7) /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x9a68) [0x7f94c8e45a68]
[bt] (8) python3(_PyObject_MakeTpCall+0x25b) [0x55ae6fa164ab]
I want to train on the demo dataset locally with 2 parties, but I get a failed in ReconnectLink error. It's mentioned in other issues that regenerating the certificates fixes this issue, but after running both bash files multiple times, the issue wasn't fixed.
I have two copies of the
federated-xgboost
folder, one starts the first party, and the other starts the second party and the aggregator.The
hosts.config
file contains:I run
python3 serve.py
from inside the hosts/basic folder. (I changed the port to 50050 in the second folder).I run this aggregator command from the root
federated-xgboost2
folder:dmlc-core/tracker/dmlc-submit --cluster rpc --num-workers 2 --host-file demo/basic/hosts.config --worker-memory 4g demo.py
Error from
federated-xgboost2
:Error from
federated-xgboost: