[SSD] : Loss is nan, stopping training

chandrasekhard2 commented 2 years ago

I tried to run SSD on docker with below params -

export BATCHSIZE=32 export NUMEPOCHS=${NUMEPOCHS:-8} export DATASET_DIR="/datasets/open-images-v6-mlperf" export EXTRA_PARAMS='--lr 0.0001 --output-dir=/results'

command: torchrun train.py --datapath=/datasets/open-images-v6-mlperf

I get the below loss is nan, stopping training error.

johntran-nv commented 2 years ago

@ahmadki is going to look at this.

chandrasekhard2 commented 2 years ago

I'm running this on single A100 GPU.

ahmadki commented 2 years ago

The train.py file doesn't read environment variables. So:

export BATCHSIZE=32
export NUMEPOCHS=${NUMEPOCHS:-8}
export DATASET_DIR="/datasets/open-images-v6-mlperf"
export EXTRA_PARAMS='--lr 0.0001 --output-dir=/results`

has no effect when you are calling train.py directly. Instead, the code is using the default lr which is 0.02. You can see this in your logs, the lr is 0.0006 even during warmup iterations.

If you want to train the model by calling train.py then you need to provide these values as CLI arguments (see python train.py --help for more information). Alternatively, you can train the model using the run_and_time.sh script which will read the env variables and convert them to CLI argument and call train.py for you.

ahmadki commented 2 years ago

@chandrasekhard2 does my last comment solve your issue ? We would like to close the bug if everything is working fine.

Thank you

johntran-nv commented 2 years ago

Closing, assuming this is resolved. Feel free to reopen if this is still an issue.

mlcommons / training

[SSD] : Loss is nan, stopping training #541