Closed rakshithvasudev closed 2 years ago
Hi @rakshithvasudev
Can you provide more information about the system you are using:
Can you also see if the error happens with NGC containers (for example nvcr.io/nvidia/pytorch:22.03-py3
)
Thank you
Hello @ahmadki
Thanks for your response. Here are answers to your questions:
Namespace(amp=True, backbone='resnext50_32x4d', batch_size=16, data_augmentation='hflip', data_layout='channels_last', data_path='/datasets/open-images-v6-mlperf', dataset='openimages-mlperf', device='cuda', dist_backend='nccl', dist_url='env://', distributed=True, epochs=8, eval_batch_s
ize=16, eval_print_freq=20, gpu=0, image_size=[800, 800], lr=0.0001, output_dir='/results', pretrained=False, print_freq=20, rank=0, resume='', seed=931998635, start_epoch=0, sync_bn=False, target_map=0.34, test_only=False, trainable_backbone_layers=3, warmup_epochs=1, warmup_factor=0.00
1, workers=4, world_size=8)
Driver Version: 470.42.01, CUDA Version: 11.4 (on host)
I understand for my host driver settings nvcr.io/nvidia/pytorch:21.10-py3
is a compatible NGC container that doesn't need backward compatibility. That's why I tried the image that was given in the dockerfile, the image you mentioned above and the image I deemed compatible. They are as follows :
1. pytorch/pytorch:1.10.0-cuda11.3-cudnn8-devel
2. nvcr.io/nvidia/pytorch:22.03-py3
3. nvcr.io/nvidia/pytorch:21.10-py3
I'm able to reproduce the same error on all of the three docker base images mentioned here. Any help would be appreciated.
Furthermore, I did do nccl-tests and didn't see any errors running them.
Thank you for providing the information. Using NGC containers helps us narrow down the issue, so let's keep using them.
First, please make sure you are using the --ipc=host
flag in your docker run command:
docker run --rm -it \
--gpus=all \
--ipc=host \
-v <HOST_DATA_DIR>:/datasets/open-images-v6-mlperf \
mlperf/single_stage_detector bash
This is to insures the processes inside your container have enough shared memory for communications. You can run:
df -h | grep shm
from inside the container to make sure there is enough memory.
If the above flag is already in use, please set the following debug environment variables and attach the logs:
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_SHOW_CPP_STACKTRACES=1
Thanks @ahmadki!
I was able to run to completion. I actually had --ipc=host
initially too. Not sure what the issue was. I exited my container and restarted everything. It failed first time I launched it, when I tried for the second time, it ran to completion.
So since my objective was to just test and see if the ssd with openimage dataset works, I'm closing this issue.
For those curious which container I had running it was nvcr.io/nvidia/pytorch:21.10-py3
Launch again and see if it helps; if you have a similar issue like mine.
Epoch: [4] [9040/9143] eta: 0:01:00 lr: 0.000100 loss: 0.5669 (0.5804) bbox_regression: 0.2549 (0.2566) classification: 0.3171 (0.3238) time: 0.5812 data: 0.0002 max mem: 27072
Epoch: [4] [9060/9143] eta: 0:00:48 lr: 0.000100 loss: 0.5765 (0.5804) bbox_regression: 0.2603 (0.2566) classification: 0.3166 (0.3238) time: 0.5871 data: 0.0002 max mem: 27072
Epoch: [4] [9080/9143] eta: 0:00:36 lr: 0.000100 loss: 0.5845 (0.5804) bbox_regression: 0.2584 (0.2566) classification: 0.3273 (0.3238) time: 0.5851 data: 0.0002 max mem: 27072
Epoch: [4] [9100/9143] eta: 0:00:25 lr: 0.000100 loss: 0.5873 (0.5804) bbox_regression: 0.2580 (0.2566) classification: 0.3248 (0.3238) time: 0.5863 data: 0.0002 max mem: 27072
Epoch: [4] [9120/9143] eta: 0:00:13 lr: 0.000100 loss: 0.5802 (0.5805) bbox_regression: 0.2578 (0.2566) classification: 0.3307 (0.3239) time: 0.5838 data: 0.0002 max mem: 27072
Epoch: [4] [9140/9143] eta: 0:00:01 lr: 0.000100 loss: 0.5604 (0.5804) bbox_regression: 0.2499 (0.2566) classification: 0.3144 (0.3238) time: 0.5817 data: 0.0009 max mem: 27072
Epoch: [4] [9142/9143] eta: 0:00:00 lr: 0.000100 loss: 0.5603 (0.5804) bbox_regression: 0.2461 (0.2566) classification: 0.3144 (0.3238) time: 0.5813 data: 0.0009 max mem: 27072
Epoch: [4] Total time: 1:29:09 (0.5851 s / it)
:::MLLOG {"namespace": "", "time_ms": 1651026630084, "event_type": "INTERVAL_END", "key": "epoch_stop", "value": 4, "metadata": {"file": "engine.py", "lineno": 60, "epoch_num": 4}}
:::MLLOG {"namespace": "", "time_ms": 1651026631099, "event_type": "INTERVAL_START", "key": "eval_start", "value": 4, "metadata": {"file": "engine.py", "lineno": 66, "epoch_num": 4}}
Test: [ 0/194] eta: 0:13:48 model_time: 0.5314 (0.5314) evaluator_time: 0.2267 (0.2267) time: 4.2701 data: 3.4947 max mem: 27072
Test: [ 20/194] eta: 0:02:04 model_time: 0.3035 (0.3190) evaluator_time: 0.2120 (0.2155) time: 0.5385 data: 0.0002 max mem: 27072
Test: [ 40/194] eta: 0:01:36 model_time: 0.2878 (0.3081) evaluator_time: 0.1955 (0.2173) time: 0.5298 data: 0.0002 max mem: 27072
Test: [ 60/194] eta: 0:01:18 model_time: 0.2921 (0.3029) evaluator_time: 0.1959 (0.2125) time: 0.5082 data: 0.0002 max mem: 27072
Test: [ 80/194] eta: 0:01:04 model_time: 0.2910 (0.3013) evaluator_time: 0.1930 (0.2081) time: 0.5056 data: 0.0002 max mem: 27072
Test: [100/194] eta: 0:00:52 model_time: 0.2950 (0.2997) evaluator_time: 0.2075 (0.2067) time: 0.5075 data: 0.0002 max mem: 27072
Test: [120/194] eta: 0:00:40 model_time: 0.2902 (0.2992) evaluator_time: 0.1863 (0.2044) time: 0.5053 data: 0.0002 max mem: 27072
Test: [140/194] eta: 0:00:29 model_time: 0.2890 (0.2978) evaluator_time: 0.1955 (0.2039) time: 0.5036 data: 0.0002 max mem: 27072
Test: [160/194] eta: 0:00:18 model_time: 0.2915 (0.2972) evaluator_time: 0.1988 (0.2037) time: 0.5082 data: 0.0002 max mem: 27072
Test: [180/194] eta: 0:00:07 model_time: 0.2921 (0.2970) evaluator_time: 0.2024 (0.2049) time: 0.5243 data: 0.0002 max mem: 27072
Test: [193/194] eta: 0:00:00 model_time: 0.2885 (0.2958) evaluator_time: 0.1834 (0.2030) time: 0.4867 data: 0.0002 max mem: 27072
Test: Total time: 0:01:43 (0.5343 s / it)
Averaged stats: model_time: 0.2885 (0.2957) evaluator_time: 0.1834 (0.2044)
Accumulating evaluation results...
DONE (t=92.93s).
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.34282
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.48390
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.36762
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.02334
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.11563
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.38066
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.40912
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.58227
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.60925
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.07781
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.29512
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.66112
:::MLLOG {"namespace": "", "time_ms": 1651027002780, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.34282287555785546, "metadata": {"file": "engine.py", "lineno": 108, "epoch_num": 4}}
:::MLLOG {"namespace": "", "time_ms": 1651027010611, "event_type": "INTERVAL_END", "key": "eval_stop", "value": 4, "metadata": {"file": "engine.py", "lineno": 109, "epoch_num": 4}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "INTERVAL_END", "key": "run_stop", "value": null, "metadata": {"file": "train.py", "lineno": 257, "status": "success"}}
Training time 8:00:35
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020658, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
ENDING TIMING RUN AT 2022-04-27 02:37:32 AM
RESULT,SINGLE_STAGE_DETECTOR,,28890,nvidia,2022-04-26 06:36:02 PM
Hello All,
Thanks for providing the ssd implementation with resnext. I'm running with a batch size of 16 with the
run_and_time.sh
script without slurm on a single node. It seems to have NCCL issues, any idea to resolve this? Seems like an all reduce timeout.Thanks in advance!