mc2-project / federated-xgboost

Federated gradient boosted decision tree learning
68 stars 20 forks source link

Is privacy protection considered in this project? #9

Closed hacql2004 closed 4 years ago

hacql2004 commented 4 years ago

Hi, I'm a fan with your mc2 project and glad to see your federated-xgboost upgrading recently. It seems that some portable function(like listen port and aggregator invitation) has been added into new version. I still have some questions with your project.

  1. Comparing with the vanilla XGBoost, did your project redesign the basic model map-reduce framework (core theory for horizontal federated learning) or just reuse this part of vanilla XGBoost?
  2. Except for the mechanism of aggregator invitation, are there any more privacy protection measures(like sharing model parameters encryption) considered in your project? Thanks.
podcastinator commented 4 years ago

Hi @hacql2004, thanks for your interest!

  1. This project reuses the distributed training algorithms of vanilla XGBoost. In those algorithms, each node exchanges summaries of their data with the others, so perhaps this is sufficient from the perspective of federated learning. We make minor modifications to change the communication pattern, however, so that each node sends the summary to a centralized aggregator, and not to the other nodes.

  2. We are in the process of adding TLS protection for communication between each node and the aggregator. For now, we do not consider stronger guarantees than that (e.g. encrypting the summaries, and aggregating encrypted summaries), but that would be a very interesting feature to add in the future. Happy to collaborate on this, if case you would like to contribute :-) For stronger privacy guarantees, please check our Secure XGBoost project that uses hardware enclaves to keep all the data private at all times.

hacql2004 commented 4 years ago

Thanks for your reply, but I still feel a bit confused here. In your reply mentioned that 'We make minor modifications to change the communication pattern, however, so that each node sends the summary to a centralized aggregator, and not to the other nodes.' Here did you mean all client nodes only communicate with the server node in the original vanilla XGBoost? Or original xgboost doesn't support this only after your modifications.

Because I notice that function checkpoint/lazycheckpoint would backup client node's local data and recover data from other client node once it unexpectedly shutdown. This means there exists direct data transfering between client nodes in original xgboost. Could you explain this further? Thanks.

Here is the related part picked up from rabit official tutorial:

chester-leung commented 4 years ago

@hacql2004 we designate one node to be the centralized aggregator, and all other nodes to be clients. The aggregator establishes a connection with each client, and each client talks only to the aggregator. In other words, if there are n clients, the aggregator establishes n connections, and each client establishes 1 connection (to the aggregator).

You're correct that in vanilla XGBoost, clients may talk to one another. In Federated XGBoost, they cannot. This is exactly the distinction in communication pattern between vanilla and Federated XGBoost.

Here's a diagram of the communication pattern and general (simplified) workflow in Federated XGBoost. Hope this reduces the confusion!

image

hacql2004 commented 4 years ago

It's clear now, thanks.