Multi-GPU benchmarks fail

nileshnegi commented 4 years ago

Out of these 19 benchmarks, convnet_distributed, convnet_distibuted_fp16 and dcgan_all fail to run. The default parameters are used here.

stack smashing detected : terminated Aborted (core dumped) Traceback (most recent call last): File "run.py", line 162, in run_job_def run_job(cmd, config, group, definition['name']) File "run.py", line 105, in run_job subprocess.check_call(f"{prefix} {cmd} {config} --seed {opt.uid + device_count}", shell=True, env=env) File "/opt/conda/envs/mlperf/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'cgexec -g memory:all nocache ./image_classification/convnets/pytorch/run_distributed.sh --cuda --report $OUTPUT_DIRECTORY/$JOB_FILE --arch resnet101 --workers 8 --batch-size 64 --number 5 --repeat 15 $DATA_DIRECTORY/ImageNet/train --seed 8' returned non-zero exit status 134. ./image_classification/convnets/pytorch/run_distributed.sh [17/19][ / ] FAILED | 0.88 MIN | ./image_classification/convnets/pytorch/run_distributed.sh

Total Time 52.82 s

Delaunay commented 4 years ago

Can you try the scaling branch and run the scaling benchmark ?

$ git checkout scaling
$ ./image_classification/scaling/pytorch/run.sh --repeat 10 --number 5 --network resnet101 --batch-size 64

If it fails as well, there is probably a mismatched between your pytorch versions and its dependencies

nileshnegi commented 4 years ago

For my previous run which gives a 'stack smashing' error, I was using the latest versions of torch (1.3.1) and torchvision(0.4.2) installed using conda, after installing all python dependencies

Now I tried a run with the scaling branch where I installed only torch(1.3.0) after all python dependencies were installed. I get the following error related to torchvision.

Traceback (most recent call last):
  File "./image_classification/scaling/pytorch/micro_bench.py", line 213, in <module>
    main()
  File "./image_classification/scaling/pytorch/micro_bench.py", line 174, in main
    distributed_parameters)
  File "./image_classification/scaling/pytorch/micro_bench.py", line 69, in run_benchmarking
    network = get_network(net)
  File "./image_classification/scaling/pytorch/micro_bench.py", line 30, in get_network
    segmentation_models = torchvision.models.segmentation.__dict__
AttributeError: module 'torchvision.models' has no attribute 'segmentation'
(1, ['CUDA_VISIBLE_DEVICES=0', '/opt/conda/envs/mlperf/bin/python', '-u', './image_classification/scaling/pytorch/micro_bench.py', '--distributed_dataparallel', '--rank', '0', '--world-size', '1', '--dist-backend', 'nccl', '--dist-url', 'tcp://localhost:8181', '--batch-size', '64', '--number', '5', '--network', 'resnet101'])
Traceback (most recent call last):
  File "./image_classification/scaling/pytorch/scaling.py", line 120, in <module>
    main()
  File "./image_classification/scaling/pytorch/scaling.py", line 74, in main
    assert rc == 0, 'Failed to run distributed script'
AssertionError: Failed to run distributed script

requirements.txt lists torchvision version as 0.2.2.

Using torch 1.1.0 and torchvision 0.3.0 gives the stack smashing error as well

--------------------------------------------------------------------------------
                    batch_size: 64
                          cuda: True
                       workers: 0
                          seed: 0
                       devices: 1
                        repeat: 10
                        number: 5
                        report: None
                         jr_id: 0
                           vcd: 0
                     cpu_cores: 10
--------------------------------------------------------------------------------
Initializing process group...
Rendezvous complete. Created process group...
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
(134, ['CUDA_VISIBLE_DEVICES=0', '/opt/conda/envs/mlperf/bin/python', '-u', './image_classification/scaling/pytorch/micro_bench.py', '--distributed_dataparallel', '--rank', '0', '--world-size', '1', '--dist-backend', 'nccl', '--dist-url', 'tcp://localhost:8181', '--batch-size', '64', '--number', '5', '--network', 'resnet101'])
Traceback (most recent call last):
  File "./image_classification/scaling/pytorch/scaling.py", line 120, in <module>
    main()
  File "./image_classification/scaling/pytorch/scaling.py", line 74, in main
    assert rc == 0, 'Failed to run distributed script'
AssertionError: Failed to run distributed script

Delaunay commented 4 years ago

Thanks, I updated the scaling to ignore segmentation models if absent. Can you retry with 1.3.0 ?

Yes, torchvision needed to be 0.2.2 because they added some cuda code after that version and it is not yet compatible with ROCm. The version will be bumped only after torchvision starts supporting ROCm.

nileshnegi commented 4 years ago

Unfortunately I am still running into the stack smashing error. Currently using torch 1.3.0

--------------------------------------------------------------------------------
                    batch_size: 64
                          cuda: True
                       workers: 0
                          seed: 0
                       devices: 1
                        repeat: 10
                        number: 5
                        report: None
                         jr_id: 0
                           vcd: 0
                     cpu_cores: 10
--------------------------------------------------------------------------------
Initializing process group...
Rendezvous complete. Created process group...
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
(134, ['CUDA_VISIBLE_DEVICES=0', '/opt/conda/envs/mlperf/bin/python', '-u', './image_classification/scaling/pytorch/micro_bench.py', '--distributed_dataparallel', '--rank', '0', '--world-size', '1', '--dist-backend', 'nccl', '--dist-url', 'tcp://localhost:8181', '--batch-size', '64', '--number', '5', '--network', 'resnet101'])
Traceback (most recent call last):
  File "./image_classification/scaling/pytorch/scaling.py", line 120, in <module>
    main()
  File "./image_classification/scaling/pytorch/scaling.py", line 74, in main
    assert rc == 0, 'Failed to run distributed script'
AssertionError: Failed to run distributed script

If I use requirements.txt as is, I get this message

ERROR: torchvision 0.2.2 has requirement tqdm==4.19.9, but you'll have tqdm 4.31.1 which is incompatible.

Replacing tqdm with v4.19.9 also leads to the same stack smashing error

Delaunay commented 4 years ago

tqdm is only used to show progress bars. Are you running on a standard x86_64 machine ?

obilaniu commented 4 years ago

A stack smashing error is an indication of a buffer overflow in low-level C & C++ code, and cannot normally be provoked from Python code. Further, the examples that are failing are specifically the "distributed" ones.

That suggests the problem lies outside this Python repo, and specifically with the distributed module code. It could be that some library related to distributed computing protocols like MPI is miscompiled, or the wrong version of it is loaded, or a wrong implementation entirely is loaded. This part is very easy to screw up. Make sure the right libraries are found and loaded. One version or implementation might assume that a data structure is bigger than another assumes it to be, and that disagreement can lead to corruption of the stack or worse.

I would urge you to wrap your failing command with gdb --args and get a backtrace at the point of the abort, or else to do gdb path/to/corefile and get a backtrace. info proc mappings will tell you all the libraries that are currently loaded.

nileshnegi commented 4 years ago

Thank you @obilaniu and @Delaunay The problem was resolved by using a different base image.

mila-iqia / training

Multi-GPU benchmarks fail #18