Can I use data parallel with different data on each machine?

jacopocecch commented 1 year ago

Hi, I was just wondering, can I use the LightGBM data parallel feature to simulate federated learning? Suppose I have two machines and each machine has its local dataset, does it work? Or does the data in each machine have to be the same for it to work? Thanks in advance!

jameslamb commented 1 year ago

Or does the data in each machine have to be the same for it to work?

With data parallel training in LightGBM, each machine can hold a different non-overlapping subset of the full training data. This is described in some detail in these resources:

https://lightgbm.readthedocs.io/en/stable/Features.html#optimization-in-distributed-learning
one of the original LightGBM papers ("A Communication-Efficient Parallel Algorithm for Decision Tree").

can I use the LightGBM data parallel feature to simulate federated learning?

It depends on how you define "federated learning".

I've usually seen that phrase used to imply that details about the dataset aren't shared between individual training processes, and that the only thing that's shared across the network are things like the learned parameters (or splits, for the case of tree-based stuff like LightGBM) and gradients.

I've also usually seen that phrase used to refer to doing distributing training in environments where there are restrictions on data movement ...e.g. you have training data in different geographic locations and do not want to physically move it, but want to train a model over all of it.

So I want to be sure you understand... LightGBM distributed training (all of them... data-parallel, feature-parallel, and voting-parallel) does involve a global sync-up where all training processes hold information like the global distribution of each feature (i.e., bin boundaries of the histograms). And that "distribution" includes specific values of categorical variables (which is a common representation for potentially sensitive information like country, zip code, age, etc.).

jameslamb commented 1 year ago

@guolinke @shiyu1994 please correct me if I've said anything that's incorrect.

jacopocecch commented 1 year ago

@jameslamb Thank you for the answer, it's all clear now.

shiyu1994 commented 1 year ago

@jacopocecch Thanks for using LightGBM.

Suppose I have two machines and each machine has its local dataset, does it work? Or does the data in each machine have to be the same for it to work?

Yes, you can have different machine with its own part of data.

This parameter pre_partition https://lightgbm.readthedocs.io/en/latest/Parameters.html#pre_partition tells LightGBM whether the data in distributed training has been partitioned across machines or each machine has a full copy of the data. In your case, you need to set pre_partition=true.

jameslamb commented 1 year ago

Ah thanks, I forgot about also setting pre_partition=True.

jacopocecch commented 1 year ago

@shiyu1994 thank you

github-actions[bot] commented 1 month ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

microsoft / LightGBM

Can I use data parallel with different data on each machine? #6148