Closed chandrasekhard2 closed 2 years ago
@ahmadki is going to look at this.
I'm running this on single A100 GPU.
The train.py
file doesn't read environment variables. So:
export BATCHSIZE=32
export NUMEPOCHS=${NUMEPOCHS:-8}
export DATASET_DIR="/datasets/open-images-v6-mlperf"
export EXTRA_PARAMS='--lr 0.0001 --output-dir=/results`
has no effect when you are calling train.py
directly. Instead, the code is using the default lr which is 0.02
. You can see this in your logs, the lr is 0.0006 even during warmup iterations.
If you want to train the model by calling train.py
then you need to provide these values as CLI arguments (see python train.py --help
for more information).
Alternatively, you can train the model using the run_and_time.sh
script which will read the env variables and convert them to CLI argument and call train.py
for you.
@chandrasekhard2 does my last comment solve your issue ? We would like to close the bug if everything is working fine.
Thank you
Closing, assuming this is resolved. Feel free to reopen if this is still an issue.
I tried to run SSD on docker with below params -
export BATCHSIZE=32 export NUMEPOCHS=${NUMEPOCHS:-8} export DATASET_DIR="/datasets/open-images-v6-mlperf" export EXTRA_PARAMS='--lr 0.0001 --output-dir=/results'
command: torchrun train.py --datapath=/datasets/open-images-v6-mlperf
I get the below loss is nan, stopping training error.