mc2-project / federated-xgboost

Federated gradient boosted decision tree learning
68 stars 20 forks source link

X509 - Certificate verification failed, e.g. CRL, CA or signature check failed #26

Closed litlep-nibbyt closed 1 year ago

litlep-nibbyt commented 2 years ago

I've been trying to run the demo on one machine. I'm running into this error related to certificate verification. I've tried re-running certs/run-agg.sh to regenerate a new certificate, but I'm getting the same result.

If I'm running this on 3 processes (2 workers and 1 aggregator) -- do I need 3 copies of federated-xgboost? I currently have two copies of federated-xgboost. One of them is used by a worker and an aggregator and the other one is used by only a worker.

Starting RPC server on port  50051
Request from aggregator [ipv4:172.17.0.2:35996] to start federated training session:
Please enter 'Y' to confirm or 'N' to reject.
Join session? [Y/N]: Y
Starting federated training session
E0406 19:10:07.737366700    2700 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
failed in ReconnectLink [19:10:11] /build/federated-xgboost/rabit/include/rabit/internal/ssl_socket.h:26: X509 - Certificate verification failed, e.g. CRL, CA or signature check failed
Stack trace:
  [bt] (0) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(+0x2a2674) [0x7f9faeab6674]
  [bt] (1) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(+0x2a2c57) [0x7f9faeab6c57]
  [bt] (2) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceBase::ReConnectLinks(char const*)+0xe45) [0x7f9faeaafd85]
  [bt] (3) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceBase::Init(int, char**)+0x33d) [0x7f9faeab128d]
  [bt] (4) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::AllreduceRobust::Init(int, char**)+0xe) [0x7f9faeabb11e]
  [bt] (5) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(rabit::engine::Init(int, char**)+0x46c) [0x7f9faeab7b7c]
  [bt] (6) /opt/conda/envs/federated-xgboost/lib/python3.9/lib-dynload/../../libffi.so.8(+0x6a4a) [0x7f9fb3337a4a]
  [bt] (7) /opt/conda/envs/federated-xgboost/lib/python3.9/lib-dynload/../../libffi.so.8(+0x5fea) [0x7f9fb3336fea]
  [bt] (8) /opt/conda/envs/federated-xgboost/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x132ad) [0x7f9fb33502ad]

/opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/core.py:613: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption
  warnings.warn("Use subset (sliced data) of np.ndarray is not recommended " +
Number of parties in federation:  2
Training
Traceback (most recent call last):
  File "/build/federated-xgboost/demo/basic/demo.py", line 31, in <module>
    bst = fxgb.train(params, dtrain, num_rounds, evals=[(dtrain, "dtrain"), (dval, "dval")])
  File "/opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/training.py", line 212, in train
    return _train_internal(params, dtrain,
  File "/opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/training.py", line 74, in _train_internal
    bst.update(dtrain, i, obj)
  File "/opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/core.py", line 1108, in update
    _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, ctypes.c_int(iteration),
  File "/opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/core.py", line 176, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
federatedxgboost.core.XGBoostError: [19:10:18] /build/federated-xgboost/include/xgboost/tree_model.h:295: Check failed: fi->Read(&param, sizeof(TreeParam)) == sizeof(TreeParam) (0 vs. 148) :
Stack trace:
  [bt] (0) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x7f9fae8aadb3]
  [bt] (1) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(xgboost::RegTree::Load(dmlc::Stream*)+0xd9) [0x7f9fae92a429]
  [bt] (2) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(xgboost::tree::TreeSyncher::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x352) [0x7f9faea40a02]
  [bt] (3) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(xgboost::tree::QuantileHistMaker::Builder::Update(xgboost::common::GHistIndexMatrix const&, xgboost::common::GHistIndexBlockMatrix const&, xgboost::common::ColumnMatrix const&, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, xgboost::RegTree*)+0x473) [0x7f9faea27853]
  [bt] (4) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(xgboost::tree::QuantileHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x17e) [0x7f9faea1ec8e]
  [bt] (5) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x509) [0x7f9fae9333b9]
  [bt] (6) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::ObjFunction*)+0xa41) [0x7f9fae934411]
  [bt] (7) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x215) [0x7f9fae943bb5]
  [bt] (8) /opt/conda/envs/federated-xgboost/lib/python3.9/site-packages/federatedxgboost-0.90-py3.9.egg/federatedxgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x48) [0x7f9fae8a3148]
luckystarufo commented 2 years ago

I solved it by running both 'run-agg.sh' and 'gen-root.sh'

mansrim commented 1 year ago

Hey @luckystarufo, I am facing the same error and your fix does not seem to be working for me so i thought i might ask you. When you say you ran both 'run-agg.sh' and 'gen-root.sh', you mean you ran both files on each machine (workers and aggregator)? Or do you mean you ran the 'run-agg.sh' file on the aggregator and the 'gen-root.sh' on the workers? Because I tried it both ways and I still get the “Certificate verification failed, e.g. CRL, CA or signature check failed” error. Ps: I am running the simulation on three Linux VMs (one aggregator and two workers) each having their own copy of the federated-xgboost folder. Thank you in advance.

luckystarufo commented 1 year ago

Hi @mansrim , I remembered I ran both scripts: 'gen-root.sh' and then 'run-agg.sh' on all machines.

roxanadangerm commented 1 year ago

Hi @mansrim and @luckystarufo , I am facing the same issue. Did you solve it? I am running in a docker container. Thank you very much in advance.

mansrim commented 1 year ago

hello @roxanadangerm, no i'm afraid i couldn't solve it

kaoutherab commented 1 year ago

@roxanadangerm @mansrim here is how I made it to work for me: Assuming I have 1 aggregator and 2 workers, and working directory is certs: 1- On aggregator: run gen-root.sh , then run gen-agg.sh 2- copy root.pem and root_cert.crt generated by the aggreagator into the workers 3- run only gen-agg.sh from workers