KITS results mismatch with paper

luciainnocenti commented 1 year ago

Running the KITS benchmark in the Pooled version gived different results than those in the paper.

I kept the parameters as they are in the repository, namely:

NUM_CLIENTS = 6
BATCH_SIZE = 2
NUM_EPOCHS_POOLED = 500  # 8000 gives better performance but is too long
LR = 3e-4
Optimizer = torch.optim.Adam

but my pooled test accuracy is <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

Test	optimizer_class	beta2	server_learning_rate	mu	Metric	seed	learning_rate	tau	Method
client_test_0	<class 'torch.optim.adam.Adam'>				0.073226817	42	0.0003		Pooled Training
client_test_1	<class 'torch.optim.adam.Adam'>				0.0148094064	42	0.0003		Pooled Training
client_test_2	<class 'torch.optim.adam.Adam'>				0.0357094109	42	0.0003		Pooled Training
client_test_3	<class 'torch.optim.adam.Adam'>				0.0399805941	42	0.0003		Pooled Training
client_test_4	<class 'torch.optim.adam.Adam'>				0.0458834544	42	0.0003		Pooled Training
client_test_5	<class 'torch.optim.adam.Adam'>				0.0208523367	42	0.0003		Pooled Training
Pooled Test	<class 'torch.optim.adam.Adam'>				0.0323568322	42	0.0003		Pooled Training

jeandut commented 1 year ago

Hi @luciainnocenti, Thank you for your interest, could you give the exact command that you ran ? The results should be obtained by running:

cd flamby/benchmarks
python fed_benchmark.py --seed 42 -cfp ../config_kits19.json

@ErumMushtaq can you take a look ?

luciainnocenti commented 1 year ago

Hi @jeandut, I confirm that's the exact command I ran. I executed already all the other benchmarks (but LIDC-IDRI) and everything is fine. This is the problematic one

jeandut commented 1 year ago

We will be looking into it. Thanks for reporting.

ErumMushtaq commented 1 year ago

@luciainnocenti Thanks for reporting.

@jeandut Sure. Let me take a look and get back to you.

ErumMushtaq commented 1 year ago

@luciainnocenti I am able to reproduce the results on my end for the pooled training method (seed 42) by running the command Jean mentioned above. Could you confirm if other methods such as FedAvg, and FedProx are working on your end for the KiTS19 dataset?

luciainnocenti commented 1 year ago

I did not try that yet because the training is quite long and expensive, so I checked only the pooled version to make sureit worked. I will try it now.

jeandut commented 1 year ago

I am re-downloading and re-processing the dataset on my end as well to confirm if I can reproduce what @ErumMushtaq did with the current version of the repo on a fresh env. I'll try to add a hash of the dataset files so we can double-check your data @luciainnocenti. @luciainnocenti the pooled version should be alright as well I don't see any reason why it wouldn't be. If it's not the data or preprocessing I don't know what it can be frankly.

jeandut commented 1 year ago

Let's check the data first and then we'll make sure every single version of packages match including torch.

jeandut commented 1 year ago

@luciainnocenti have you been able to make it work ? I don't have much bandwidth currently but this on my TODO in the next 3 months.

jeandut commented 1 year ago

Hello @luciainnocenti I just reproduced the results with the exact command I sent you for pooled training:

python fed_benchmark.py --seed 42 -cfp ../config_kits19.json

Capture d’écran 2023-07-11 à 15 29 31 As a reference the old results in the repository are quite close (although there is some variability). We changed some things in the code to fix reproducibility in some cases and the versions of the deep learning packages were updated to the new ones (notably torch==2.0.1 which did not exist back then) which might have slightly affected the results. Capture d’écran 2023-07-11 à 15 33 27 Are you sure you followed all the exact steps indicated in here including the preprocessing ? I modified the instructions so that some steps are easier to understand. Here is the result of my pip freeze if it helps:

absl-py==1.4.0
alabaster==0.7.13
albumentations==1.3.1
astor==0.8.1
autograd==1.6.2
autograd-gamma==0.5.0
Babel==2.12.1
batchgenerators==0.25
cachetools==5.3.1
certifi==2023.5.7
cfgv==3.3.1
charset-normalizer==3.1.0
click==8.1.3
cloudpickle==2.2.1
cmake==3.26.4
contourpy==1.1.0
cycler==0.11.0
dask==2023.6.1
dicom-numpy==0.6.5
dicom2nifti==2.4.8
distlib==0.3.6
docutils==0.17.1
efficientnet-pytorch==0.7.1
exceptiongroup==1.1.2
filelock==3.12.2
-e git+https://github.com/owkin/FLamby.git@f169811ee2832329198a78726f4faa6d6f00d4c3#egg=flamby
fonttools==4.40.0
formulaic==0.6.3
fsspec==2023.6.0
future==0.18.3
google-api-core==2.11.1
google-api-python-client==2.91.0
google-auth==2.21.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==1.0.0
googleapis-common-protos==1.59.1
grpcio==1.56.0
histolab==0.6.0
httplib2==0.22.0
identify==2.5.24
idna==3.4
imageio==2.31.1
imagesize==1.4.1
importlib-metadata==6.7.0
importlib-resources==5.12.0
iniconfig==2.0.0
interface-meta==1.3.0
Jinja2==3.1.2
joblib==1.3.1
kiwisolver==1.4.4
large-image==1.23.0
large-image-source-openslide==1.23.0
lifelines==0.27.7
linecache2==1.0.0
lit==16.0.6
llvmlite==0.40.1
locket==1.0.0
Markdown==3.4.3
MarkupSafe==2.1.3
matplotlib==3.7.1
MedPy==0.4.0
monai==1.2.0
mpmath==1.3.0
networkx==3.1
nibabel==3.2.2
nnunet==1.7.0
nodeenv==1.8.0
numba==0.57.1
numpy==1.23.0
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
opacus==1.4.0
opencv-python-headless==4.8.0.74
openslide-python==1.2.0
opt-einsum==3.3.0
packaging==23.1
palettable==3.3.3
pandas==2.0.3
partd==1.4.0
Pillow==9.5.0
platformdirs==3.8.0
pluggy==1.2.0
pre-commit==3.3.3
protobuf==4.23.3
pyasn1==0.5.0
pyasn1-modules==0.3.0
pydicom==2.4.1
Pygments==2.15.1
pynndescent==0.5.10
pyparsing==3.1.0
pytest==7.4.0
python-dateutil==2.8.2
python-gdcm==3.0.22
pytz==2023.3
PyWavelets==1.4.1
PyYAML==6.0
qudida==0.0.4
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
scikit-image==0.19.3
scikit-learn==1.3.0
scipy==1.8.1
seaborn==0.12.2
SimpleITK==2.2.1
six==1.16.0
sklearn==0.0.post5
snowballstemmer==2.2.0
Sphinx==4.5.0
sphinx-rtd-theme==1.0.0
sphinxcontrib-applehelp==1.0.4
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.1
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
sympy==1.12
tensorboard==2.13.0
tensorboard-data-server==0.7.1
threadpoolctl==3.1.0
tifffile==2023.4.12
tifftools==1.3.12
tomli==2.0.1
toolz==0.12.0
torch==2.0.1
torchvision==0.15.2
tqdm==4.65.0
traceback2==1.4.0
triton==2.0.0
typing_extensions==4.7.1
tzdata==2023.3
umap-learn==0.5.3
unittest2==1.1.0
uritemplate==4.1.1
urllib3==1.26.16
virtualenv==20.23.1
Werkzeug==2.3.6
wget==3.2
wrapt==1.15.0
zipp==3.15.0

Can you check the length of the FedKits19 dataset when it is instantiated ? Can you add a Tensorboard to monitor the loss it seems there is something very wrong somewhere that must explain this discrepancy.

jeandut commented 1 year ago

As a side note, @luciainnocenti, we will probably very soon update the benchmarking guidelines with new scripts which we hope will be more easy to reproduce and with a nicer interface for results (and training curves!). In the mean time make sure your environment is matching and that you are using the GPU (it wasn't tested on CPUs only).

luciainnocenti commented 1 year ago

Hi @jeandut, I tried to follow your tip on the preprocessing. I uninstalled and re-installed everything, downloaded the data again from scratch, and now the results are comparable with the paper. So not sure if the pre-processing wasn't complete or what was the problem. I will check the libraries version just to make sure everything matches

jeandut commented 1 year ago

Super cool ! Let me close this issue then ! Do not hesitate if you run into other weird problems !

owkin / FLamby

KITS results mismatch with paper #285