Create a non-IID CIFAR10 datasets for cross-silo experiments

JYWa commented 3 years ago

Similar to the federated version of CIFAR100, we follow the methods in this paper. Currently, we assume that there are total 10 clients, each of which has 5000 images.

google-cla[bot] commented 3 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

AdvaitGadhikar commented 3 years ago

@googlebot I signed it

google-cla[bot] commented 3 years ago

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

AdvaitGadhikar commented 3 years ago

@googlebot I consent.

JYWa commented 3 years ago

Can we split this into two: one for the dataset, and a second one for the training loop?

Also, we should not make changes to the optimization/ folder. Let us try to merge the python binary in to fedopt_guide/.

Thanks!

Hi Zheng,

Thanks for the suggestions! Now this PR only contains the changes about the dataset in utils.

AdvaitGadhikar commented 3 years ago

Hi Everyone,

Thank you for your constructive comments about the sampling method! I have tried to incorporate the changes mentioned.

I am now sampling a multinomial for the label distribution on each client from the Dirichlet distribution and using this to assign samples to every client.
I noticed that if the number of samples at each client is not divisible by the batch size, the training loop gives a shape error for the last batch, hence I have kept the division without remainder
I have also taken into account the other suggested edits. I request you all to review it once and let us know what you think!

AdvaitGadhikar commented 3 years ago

Hi All,

I have made the suggested edits and now the sampling process at each client is sequential and it ensures that each client has 5000 samples with a label distribution according to a multinomial sampled from a Dirichlet distribution. I have also removed the dependence on the batch size for now.

nightldj commented 3 years ago

Merged in https://github.com/google-research/federated/commit/cb2518e0da738b8bf8d9c145459aa717a71133aa. Thank you for the contribution!

google-research / federated