pps-lab / fl-analysis

MIT License
26 stars 1 forks source link

Layer xx is NaN! #4

Open Jiahu235 opened 6 days ago

Jiahu235 commented 6 days ago

Hello! I'm encountering an error when running the code, consistently across both the MNIST and CIFAR-10 datasets. Regardless of the configures I use (including the config files in train_configs directory), it reports something wrong stating "Layer xx is NaN!" for each layer. Additionally, I receive a warning that says "WARNING:tensorboardX.x2num: NaN or Inf found in input tensor."

Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
round= 12       test_accuracy= 0.1046875        adv_success= 0  test_loss= nan  duration= 2.9140660762786865
DEBUG:root:Memory info: 6584934400

Here is my mnist_setup.yml file for MNIST dataset:

---
client:
    benign_training:
        batch_size: 64
        learning_rate: 0.02
        num_epochs: 2
        optimizer: SGD
        step_decay: true
    debug_client_training: false

    optimized_training: true
    # clip:
    #    type: l2
    #    value: 10
    model_name: lenet5_mnist
#    quantization:
#        type: probabilistic
#        bits: 8
#        frac: 7
dataset:
    # augment_data: false
    data_distribution: IID
    dataset: mnist
environment:
    experiment_name: lenet5_mnist
    # load_model: ../models/resnet18.h5
    num_clients: 48
    num_malicious_clients: 0
    num_selected_clients: 6
    use_config_dir: true
    print_every: 1
job:
    cpu_cores: 20
    cpu_mem_per_core: 4096
    gpu_memory_min: 10240
    minutes: 10
    use_gpu: 1
server:
    aggregator:
        name: FedAvg
    global_learning_rate: 1
    num_rounds: 35
    num_test_batches: 20
...

And this is my mnist_setup.yml file for CIFAR-10 dataset:

---
client:
    benign_training:
        batch_size: 64
        learning_rate: 0.02
        num_epochs: 2
        optimizer: SGD
        step_decay: true
    debug_client_training: false

    optimized_training: true
    # clip:
    #    type: l2
    #    value: 10
    model_name: lenet5_cifar
#    quantization:
#        type: probabilistic
#        bits: 8
#        frac: 7
dataset:
    # augment_data: false
    data_distribution: IID
    dataset: cifar10
environment:
    experiment_name: lenet5_cifar
    # load_model: /home/hujia/fl-analysis/models/resnet18.h5
    num_clients: 48
    num_malicious_clients: 0
    num_selected_clients: 6
    use_config_dir: true
    print_every: 1
job:
    cpu_cores: 20
    cpu_mem_per_core: 4096
    gpu_memory_min: 10240
    minutes: 10
    use_gpu: 1
server:
    aggregator:
        name: FedAvg
    global_learning_rate: 1
    num_rounds: 35
    num_test_batches: 20
...

I suspect that the issue might stem from an incorrect version of a package in my environment configuration, but what confuses me is that the code runs correctly with the Shakespeare dataset.

Jiahu235 commented 6 days ago

Here is the packages in my environment:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   2.1.0                    pypi_0    pypi
astunparse                1.6.3                    pypi_0    pypi
ca-certificates           2024.3.11            h06a4308_0  
cachetools                4.2.4                    pypi_0    pypi
certifi                   2022.12.7        py37h06a4308_0  
charset-normalizer        3.3.2                    pypi_0    pypi
configargparse            1.7                      pypi_0    pypi
cudatoolkit               10.1.243             h6bb024c_0  
cudnn                     7.6.5                cuda10.1_0  
gast                      0.3.3                    pypi_0    pypi
google-auth               1.35.0                   pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
google-pasta              0.2.0                    pypi_0    pypi
grpcio                    1.62.2                   pypi_0    pypi
h5py                      2.10.0                   pypi_0    pypi
idna                      3.7                      pypi_0    pypi
importlib-metadata        6.7.0                    pypi_0    pypi
joblib                    1.3.2                    pypi_0    pypi
keras                     2.3.1                    pypi_0    pypi
keras-applications        1.0.8                    pypi_0    pypi
keras-preprocessing       1.1.2                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1  
libffi                    3.4.4                h6a678d5_1  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libstdcxx-ng              11.2.0               h1234567_1  
markdown                  3.4.4                    pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
mashumaro                 3.9.1                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
numpy                     1.16.4                   pypi_0    pypi
oauthlib                  3.2.2                    pypi_0    pypi
openssl                   1.1.1w               h7f8727e_0  
opt-einsum                3.3.0                    pypi_0    pypi
orderedset                2.0.3                    pypi_0    pypi
packaging                 24.0                     pypi_0    pypi
pandas                    0.24.2                   pypi_0    pypi
pip                       22.3.1           py37h06a4308_0  
protobuf                  3.20.0                   pypi_0    pypi
psutil                    6.0.0                    pypi_0    pypi
pyasn1                    0.5.1                    pypi_0    pypi
pyasn1-modules            0.3.0                    pypi_0    pypi
python                    3.7.16               h7a1cb2a_0  
python-dateutil           2.9.0.post0              pypi_0    pypi
pytz                      2024.1                   pypi_0    pypi
pyyaml                    6.0.1                    pypi_0    pypi
readline                  8.2                  h5eee18b_0  
requests                  2.31.0                   pypi_0    pypi
requests-oauthlib         2.0.0                    pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
scikit-learn              1.0.2                    pypi_0    pypi
scipy                     1.4.1                    pypi_0    pypi
setuptools                65.6.3           py37h06a4308_0  
six                       1.16.0                   pypi_0    pypi
sqlite                    3.45.3               h5eee18b_0  
tensorboard               2.2.2                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
tensorboardx              2.6.2.2                  pypi_0    pypi
tensorflow-estimator      2.2.0                    pypi_0    pypi
tensorflow-gpu            2.2.0                    pypi_0    pypi
termcolor                 2.3.0                    pypi_0    pypi
threadpoolctl             3.1.0                    pypi_0    pypi
tk                        8.6.14               h39e8969_0  
typing-extensions         4.7.1                    pypi_0    pypi
urllib3                   2.0.7                    pypi_0    pypi
werkzeug                  2.2.3                    pypi_0    pypi
wheel                     0.38.4           py37h06a4308_0  
wrapt                     1.16.0                   pypi_0    pypi
xz                        5.4.6                h5eee18b_1  
zipp                      3.15.0                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_1