Closed jacopocecch closed 1 year ago
Or does the data in each machine have to be the same for it to work?
With data parallel training in LightGBM, each machine can hold a different non-overlapping subset of the full training data. This is described in some detail in these resources:
can I use the LightGBM data parallel feature to simulate federated learning?
It depends on how you define "federated learning".
I've usually seen that phrase used to imply that details about the dataset aren't shared between individual training processes, and that the only thing that's shared across the network are things like the learned parameters (or splits, for the case of tree-based stuff like LightGBM) and gradients.
I've also usually seen that phrase used to refer to doing distributing training in environments where there are restrictions on data movement ...e.g. you have training data in different geographic locations and do not want to physically move it, but want to train a model over all of it.
So I want to be sure you understand... LightGBM distributed training (all of them... data-parallel, feature-parallel, and voting-parallel) does involve a global sync-up where all training processes hold information like the global distribution of each feature (i.e., bin boundaries of the histograms). And that "distribution" includes specific values of categorical variables (which is a common representation for potentially sensitive information like country, zip code, age, etc.).
@guolinke @shiyu1994 please correct me if I've said anything that's incorrect.
@jameslamb Thank you for the answer, it's all clear now.
@jacopocecch Thanks for using LightGBM.
Suppose I have two machines and each machine has its local dataset, does it work? Or does the data in each machine have to be the same for it to work?
Yes, you can have different machine with its own part of data.
This parameter pre_partition
https://lightgbm.readthedocs.io/en/latest/Parameters.html#pre_partition
tells LightGBM whether the data in distributed training has been partitioned across machines or each machine has a full copy of the data.
In your case, you need to set pre_partition=true
.
Ah thanks, I forgot about also setting pre_partition=True
.
@shiyu1994 thank you
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
Hi, I was just wondering, can I use the LightGBM data parallel feature to simulate federated learning? Suppose I have two machines and each machine has its local dataset, does it work? Or does the data in each machine have to be the same for it to work? Thanks in advance!