kampmichael / FedDC

Apache License 2.0
7 stars 2 forks source link

the dataset #1

Open TW1L1 opened 8 months ago

TW1L1 commented 8 months ago

How do I get the dataset used in this project?

kampmichael commented 8 months ago

Thanks for asking. I just realized I didn't provide any guide on that. I added a description of how to obtain the datasets to the readme. I will put it here as well. Hope this helps.

Datasets CIFAR10: We use the torchvision version of CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html) which downloads the dataset on demand.

MNIST: We use the torchvision version of MNIST https://www.cs.toronto.edu/~kriz/cifar.html which downloads the dataset on demand.

MRI: The dataset can be downloaded from kaggle: kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection. Extract the folder brain_tumor_dataset. You can specify the path to the dataset using the --dataset-path argument.

Pneumonia: The dataset can be downloaded from kaggle: kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection. Please use the PrepareData.ipynb to copy images into a folder structure that separates train and validation data, as well as healthy and pneumonia images. You can specify the path to the dataset using the --dataset-path argument.

SUSY: The SUSY dataset can be downloaded from the UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/279/susy. Please extract the SUSY.csv file into the folder data/.

gdbc081128 commented 3 months ago

Thanks for asking. I just realized I didn't provide any guide on that. I added a description of how to obtain the datasets to the readme. I will put it here as well. Hope this helps.

Datasets CIFAR10: We use the torchvision version of CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html) which downloads the dataset on demand.

MNIST: We use the torchvision version of MNIST https://www.cs.toronto.edu/~kriz/cifar.html which downloads the dataset on demand.

MRI: The dataset can be downloaded from kaggle: kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection. Extract the folder brain_tumor_dataset. You can specify the path to the dataset using the --dataset-path argument.

Pneumonia: The dataset can be downloaded from kaggle: kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection. Please use the PrepareData.ipynb to copy images into a folder structure that separates train and validation data, as well as healthy and pneumonia images. You can specify the path to the dataset using the --dataset-path argument.

SUSY: The SUSY dataset can be downloaded from the UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/279/susy. Please extract the SUSY.csv file into the folder data/.

Why are the datasets linked so similarly?

gdbc081128 commented 3 months ago

Could you please provide more details about the environment and experimental process if possible?

kampmichael commented 3 months ago

The main reason why those links are so similar is that I am stupid. ^^ The proper link for MNIST is of course from Yan LeCun's website: http://yann.lecun.com/exdb/mnist/ We use the version from torchvision.

The correct link for the pneumonia dataset is https://www.kaggle.com/datasets/praveengovi/coronahack-chest-xraydataset

kampmichael commented 3 months ago

Of course I can provide more details on the environment and experimental process. What do you want to know? Hardware-wise we ran everything on a cluster node with 24 cores, 1TB RAM and 6 NVIDIA RTX A6000 cards, but we were able to run a few experiments in parallel on this machine.

From an experimental process: we performed parameter optimization for each method using cross validation on the training set, ran every experiment 3 times with different random data splits and then report the mean and maximum deviation over the three runs.

From a practical side, this means running the "runExp.sh" script.

gdbc081128 commented 3 months ago

Thank you for your answer. I would like to know the default path and format for data. What packages are used in the environment, such as Torch, CV2, Numpy, and their specific versions.

kampmichael commented 3 months ago

Data: For CIFAR10 and MNIST, you can use any path when you download them via torchvision. All preprocessing for those datasets is in the python scripts provided in this repository. For SUSY, the default path is data/. There is no preprocessing. For MRI, you can put it in any path you like, just provide the path via the --dataset-path argument. Again, all preprocessing is in the code. For Pneumonia, you have to run the PrepareData.ipynb notebook that prepares the data in the right format. Again, specify the path to wherever you have put the data via the --dataset-path argument.

Packages: torch >=2.1.0 torchvision >=0.16.0 numpy >= 1.23 (although probably any version will do)

sklearn >= 1.1.3 (although probably any version will do) matplotlib >= 3.4 (although probably any version will do)

CUDA 11.8

gdbc081128 commented 3 months ago

Thank you very much for your patient explanation.