mila-iqia / training

8 stars 7 forks source link

Is the scaling benchmark expected to work with GPUs w/16GB memory? #20

Open aurotripathy opened 4 years ago

aurotripathy commented 4 years ago

Per the log, it uses a ResNet101 model with a batch-size of 128 (per GPU).

This causes out-of-memory on at least two flavors of GPU drivers (ROCm and CUDA) w/16GB GPU memory.

RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 15.75 GiB total capacity; 14.53 GiB already allocated; 4.88 MiB free; 230.75 MiB cached)

RuntimeError: HIP out of memory. Tried to allocate 98.00 MiB (GPU 0; 15.98 GiB total capacity; 14.82 GiB already allocated; 782.00 MiB free; 14.97 GiB reserved in total by PyTorch)

The invocation

(1, ['CUDA_VISIBLE_DEVICES=0', '/root/anaconda3/envs/mlperf/bin/python', '-u', './image_classification/scaling/pytorch/micro_bench.py', '--distributed_dataparallel', '--rank', '0', '--world-size', '1', '--dist-backend', 'nccl', '--dist-url', 'tcp://localhost:8181', '--batch-size', '128', '--number', '10', '--network', 'resnet101'])

breuleux commented 4 years ago

That's a good point, I had tested this configuration on a machine with 8x32 GB. It is indeed possible the batch size is too large for 16 GB. You can modify the batch size in profiles/baselines.json, or copy it into a different configuration (profiles/16go.json does not have the scaling test).

aurotripathy commented 4 years ago

Thank you for pointing me to the configuration without the scaling test. On a different note, it is MILA's intention to leave the scaling benchmark as-is (if indeed it needs 32 GB of GPU memory).

Delaunay commented 4 years ago

So just throwing this out there but it might be interesting to be consistent with the batch sizes.

Since convnet already use 64 & 128 for fp16. That way you can compare 1 x 8 GPU (DataParallel) and 8 x 1 GPU (8 separate experiments).

Should the scaling be just fp16 ? There are only 2 fp16 benchmarks, might be something to keep in mind.

  {
    "name": "convnet",
    "cmd": "./image_classification/convnets/pytorch/run.sh",
    "args":{
      "$DATA_DIRECTORY/ImageNet/train": "",
      "--repeat": 15,
      "--number": 5,
      "--batch-size": 64,
      "--arch": "resnet101",
      "--report": "$OUTPUT_DIRECTORY",
      "--workers": 8
    },
    "cgroup": "student"
  },
  {
    "name": "convnet_fp16",
    "cmd": "./image_classification/convnets/pytorch/run.sh",
    "args":{
      "$DATA_DIRECTORY/ImageNet/train": "",
      "--repeat": 15,
      "--number": 5,
      "--batch-size": 128,
      "--arch": "resnet101",
      "--report": "$OUTPUT_DIRECTORY",
      "--workers": 8,
      "--cuda": "",
      "--half": ""
    },
    "cgroup": "student"
  },
  {
    "name": "scaling",
    "cmd": "./image_classification/scaling/pytorch/run.sh",
    "args": {
      "--repeat": 10,
      "--number": 10,
      "--network": "resnet101",
      "--devices": 0,
      "--batch-size": 128,
      "--report": "$OUTPUT_DIRECTORY"
    },
    "cgroup": "all"
  }
aurotripathy commented 4 years ago

If I understood your question, Should the scaling be just fp16?, I say yes.

I'm just extrapolating from the convnet scenario (above) where I see that, for the fp32 case, the batch-size is 64 and for the fp16 case the batch-size is 128 (this fits in 16GB).

I also want (more like, wish) to see that the "baseline" benchmarks work with a 16GB GPU . There can be "baseline_plus" sets for bigger models/batch-sizes.