tamiratGit / FedELM

1 stars 0 forks source link

2. Test ELM performance with different number of data points, and make a plot of "test performance vs. amount of training data" #2

Open akusok opened 10 months ago

akusok commented 10 months ago

The goal of federated learning is to build a better model by using more data (from other organizations). If the model does not improve with more data, there is no point building a federated learning.

Here we will test how the model performance improves with more data. We test 2 scenarios:

  1. Randomly distributed data. Not very realistic but a typical setup in experiments.
  2. Non-randomly distribute data. Very realistic, for example hospitals from different countries will have different types of patients and different equipment; their patient data is not "randomly distributed". We will simulate non-random distribution by splitting geographical data with its coordinates.

The idea is to start from a small number of training data points, then add more training data and check how it affects the model performance. We can take 50 data samples and add 50 more samples at a time. Two plots: randomly distributed data and non-randomly geographically distributed data.

Steps:

akusok commented 10 months ago

Idea: find out how good/bad of a model each client can create, if they don’t share data. This will be the baseline before we look into the Federated ELM. This should help us explain why federated ELM is so good for clients with little data; and actually measure how good it is in terms of accuracy improvement.

Separately for each “client”

akusok commented 10 months ago

test:

for every number of training data points: 10, 20, 30, 40, … build 15 ELMs like this, average their accuracy for every combination of (L2, neurons), then take the parameters with the best accuracy

accuracy = {} for l in (10, 20, 30, 45, 70, 100): for L2 in (1e-2, 1e-3, 1e-4): accuracy[l for run in range(15): accuracy … (rest shown in code notebook)

akusok commented 10 months ago

Find the best ELM parameters for EACH number of training points: 10, 20, 30, 40,… Then use these parameters, and get the plots of performance vs amount of training data