pps-lab / fl-analysis

MIT License
27 stars 1 forks source link

Layer xx is NaN! #4

Open Jiahu235 opened 4 months ago

Jiahu235 commented 4 months ago

Hello! I'm encountering an error when running the code, consistently across both the MNIST and CIFAR-10 datasets. Regardless of the configures I use (including the config files in train_configs directory), it reports something wrong stating "Layer xx is NaN!" for each layer. Additionally, I receive a warning that says "WARNING:tensorboardX.x2num: NaN or Inf found in input tensor."

Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
round= 12       test_accuracy= 0.1046875        adv_success= 0  test_loss= nan  duration= 2.9140660762786865
DEBUG:root:Memory info: 6584934400

Here is my mnist_setup.yml file for MNIST dataset:

---
client:
    benign_training:
        batch_size: 64
        learning_rate: 0.02
        num_epochs: 2
        optimizer: SGD
        step_decay: true
    debug_client_training: false

    optimized_training: true
    # clip:
    #    type: l2
    #    value: 10
    model_name: lenet5_mnist
#    quantization:
#        type: probabilistic
#        bits: 8
#        frac: 7
dataset:
    # augment_data: false
    data_distribution: IID
    dataset: mnist
environment:
    experiment_name: lenet5_mnist
    # load_model: ../models/resnet18.h5
    num_clients: 48
    num_malicious_clients: 0
    num_selected_clients: 6
    use_config_dir: true
    print_every: 1
job:
    cpu_cores: 20
    cpu_mem_per_core: 4096
    gpu_memory_min: 10240
    minutes: 10
    use_gpu: 1
server:
    aggregator:
        name: FedAvg
    global_learning_rate: 1
    num_rounds: 35
    num_test_batches: 20
...

And this is my mnist_setup.yml file for CIFAR-10 dataset:

---
client:
    benign_training:
        batch_size: 64
        learning_rate: 0.02
        num_epochs: 2
        optimizer: SGD
        step_decay: true
    debug_client_training: false

    optimized_training: true
    # clip:
    #    type: l2
    #    value: 10
    model_name: lenet5_cifar
#    quantization:
#        type: probabilistic
#        bits: 8
#        frac: 7
dataset:
    # augment_data: false
    data_distribution: IID
    dataset: cifar10
environment:
    experiment_name: lenet5_cifar
    # load_model: /home/hujia/fl-analysis/models/resnet18.h5
    num_clients: 48
    num_malicious_clients: 0
    num_selected_clients: 6
    use_config_dir: true
    print_every: 1
job:
    cpu_cores: 20
    cpu_mem_per_core: 4096
    gpu_memory_min: 10240
    minutes: 10
    use_gpu: 1
server:
    aggregator:
        name: FedAvg
    global_learning_rate: 1
    num_rounds: 35
    num_test_batches: 20
...

I suspect that the issue might stem from an incorrect version of a package in my environment configuration, but what confuses me is that the code runs correctly with the Shakespeare dataset.

Jiahu235 commented 4 months ago

Here is the packages in my environment:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   2.1.0                    pypi_0    pypi
astunparse                1.6.3                    pypi_0    pypi
ca-certificates           2024.3.11            h06a4308_0  
cachetools                4.2.4                    pypi_0    pypi
certifi                   2022.12.7        py37h06a4308_0  
charset-normalizer        3.3.2                    pypi_0    pypi
configargparse            1.7                      pypi_0    pypi
cudatoolkit               10.1.243             h6bb024c_0  
cudnn                     7.6.5                cuda10.1_0  
gast                      0.3.3                    pypi_0    pypi
google-auth               1.35.0                   pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
google-pasta              0.2.0                    pypi_0    pypi
grpcio                    1.62.2                   pypi_0    pypi
h5py                      2.10.0                   pypi_0    pypi
idna                      3.7                      pypi_0    pypi
importlib-metadata        6.7.0                    pypi_0    pypi
joblib                    1.3.2                    pypi_0    pypi
keras                     2.3.1                    pypi_0    pypi
keras-applications        1.0.8                    pypi_0    pypi
keras-preprocessing       1.1.2                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1  
libffi                    3.4.4                h6a678d5_1  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libstdcxx-ng              11.2.0               h1234567_1  
markdown                  3.4.4                    pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
mashumaro                 3.9.1                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
numpy                     1.16.4                   pypi_0    pypi
oauthlib                  3.2.2                    pypi_0    pypi
openssl                   1.1.1w               h7f8727e_0  
opt-einsum                3.3.0                    pypi_0    pypi
orderedset                2.0.3                    pypi_0    pypi
packaging                 24.0                     pypi_0    pypi
pandas                    0.24.2                   pypi_0    pypi
pip                       22.3.1           py37h06a4308_0  
protobuf                  3.20.0                   pypi_0    pypi
psutil                    6.0.0                    pypi_0    pypi
pyasn1                    0.5.1                    pypi_0    pypi
pyasn1-modules            0.3.0                    pypi_0    pypi
python                    3.7.16               h7a1cb2a_0  
python-dateutil           2.9.0.post0              pypi_0    pypi
pytz                      2024.1                   pypi_0    pypi
pyyaml                    6.0.1                    pypi_0    pypi
readline                  8.2                  h5eee18b_0  
requests                  2.31.0                   pypi_0    pypi
requests-oauthlib         2.0.0                    pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
scikit-learn              1.0.2                    pypi_0    pypi
scipy                     1.4.1                    pypi_0    pypi
setuptools                65.6.3           py37h06a4308_0  
six                       1.16.0                   pypi_0    pypi
sqlite                    3.45.3               h5eee18b_0  
tensorboard               2.2.2                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
tensorboardx              2.6.2.2                  pypi_0    pypi
tensorflow-estimator      2.2.0                    pypi_0    pypi
tensorflow-gpu            2.2.0                    pypi_0    pypi
termcolor                 2.3.0                    pypi_0    pypi
threadpoolctl             3.1.0                    pypi_0    pypi
tk                        8.6.14               h39e8969_0  
typing-extensions         4.7.1                    pypi_0    pypi
urllib3                   2.0.7                    pypi_0    pypi
werkzeug                  2.2.3                    pypi_0    pypi
wheel                     0.38.4           py37h06a4308_0  
wrapt                     1.16.0                   pypi_0    pypi
xz                        5.4.6                h5eee18b_1  
zipp                      3.15.0                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_1
hiddely commented 4 months ago

Hi, thanks for the information. This error indicates the model weights are too large. Does this error appear immediately or only after some rounds?

One straightforward way to mitigate this issue might be to reduce the learning rate.