Open aurotripathy opened 4 years ago
That's a good point, I had tested this configuration on a machine with 8x32 GB. It is indeed possible the batch size is too large for 16 GB. You can modify the batch size in profiles/baselines.json
, or copy it into a different configuration (profiles/16go.json
does not have the scaling test).
Thank you for pointing me to the configuration without the scaling test. On a different note, it is MILA's intention to leave the scaling benchmark as-is (if indeed it needs 32 GB of GPU memory).
So just throwing this out there but it might be interesting to be consistent with the batch sizes.
Since convnet already use 64 & 128 for fp16. That way you can compare 1 x 8 GPU (DataParallel) and 8 x 1 GPU (8 separate experiments).
Should the scaling be just fp16 ? There are only 2 fp16 benchmarks, might be something to keep in mind.
{
"name": "convnet",
"cmd": "./image_classification/convnets/pytorch/run.sh",
"args":{
"$DATA_DIRECTORY/ImageNet/train": "",
"--repeat": 15,
"--number": 5,
"--batch-size": 64,
"--arch": "resnet101",
"--report": "$OUTPUT_DIRECTORY",
"--workers": 8
},
"cgroup": "student"
},
{
"name": "convnet_fp16",
"cmd": "./image_classification/convnets/pytorch/run.sh",
"args":{
"$DATA_DIRECTORY/ImageNet/train": "",
"--repeat": 15,
"--number": 5,
"--batch-size": 128,
"--arch": "resnet101",
"--report": "$OUTPUT_DIRECTORY",
"--workers": 8,
"--cuda": "",
"--half": ""
},
"cgroup": "student"
},
{
"name": "scaling",
"cmd": "./image_classification/scaling/pytorch/run.sh",
"args": {
"--repeat": 10,
"--number": 10,
"--network": "resnet101",
"--devices": 0,
"--batch-size": 128,
"--report": "$OUTPUT_DIRECTORY"
},
"cgroup": "all"
}
If I understood your question, Should the scaling be just fp16?, I say yes.
I'm just extrapolating from the convnet scenario (above) where I see that, for the fp32 case, the batch-size is 64 and for the fp16 case the batch-size is 128 (this fits in 16GB).
I also want (more like, wish) to see that the "baseline" benchmarks work with a 16GB GPU . There can be "baseline_plus" sets for bigger models/batch-sizes.
Per the log, it uses a ResNet101 model with a batch-size of 128 (per GPU).
This causes out-of-memory on at least two flavors of GPU drivers (ROCm and CUDA) w/16GB GPU memory.
RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 15.75 GiB total capacity; 14.53 GiB already allocated; 4.88 MiB free; 230.75 MiB cached)
RuntimeError: HIP out of memory. Tried to allocate 98.00 MiB (GPU 0; 15.98 GiB total capacity; 14.82 GiB already allocated; 782.00 MiB free; 14.97 GiB reserved in total by PyTorch)
The invocation
(1, ['CUDA_VISIBLE_DEVICES=0', '/root/anaconda3/envs/mlperf/bin/python', '-u', './image_classification/scaling/pytorch/micro_bench.py', '--distributed_dataparallel', '--rank', '0', '--world-size', '1', '--dist-backend', 'nccl', '--dist-url', 'tcp://localhost:8181', '--batch-size', '128', '--number', '10', '--network', 'resnet101'])