Closed nileshnegi closed 4 years ago
Can you try the scaling branch and run the scaling benchmark ?
$ git checkout scaling
$ ./image_classification/scaling/pytorch/run.sh --repeat 10 --number 5 --network resnet101 --batch-size 64
If it fails as well, there is probably a mismatched between your pytorch versions and its dependencies
For my previous run which gives a 'stack smashing' error, I was using the latest versions of torch (1.3.1) and torchvision(0.4.2) installed using conda, after installing all python dependencies
Now I tried a run with the scaling branch where I installed only torch(1.3.0) after all python dependencies were installed. I get the following error related to torchvision.
Traceback (most recent call last):
File "./image_classification/scaling/pytorch/micro_bench.py", line 213, in <module>
main()
File "./image_classification/scaling/pytorch/micro_bench.py", line 174, in main
distributed_parameters)
File "./image_classification/scaling/pytorch/micro_bench.py", line 69, in run_benchmarking
network = get_network(net)
File "./image_classification/scaling/pytorch/micro_bench.py", line 30, in get_network
segmentation_models = torchvision.models.segmentation.__dict__
AttributeError: module 'torchvision.models' has no attribute 'segmentation'
(1, ['CUDA_VISIBLE_DEVICES=0', '/opt/conda/envs/mlperf/bin/python', '-u', './image_classification/scaling/pytorch/micro_bench.py', '--distributed_dataparallel', '--rank', '0', '--world-size', '1', '--dist-backend', 'nccl', '--dist-url', 'tcp://localhost:8181', '--batch-size', '64', '--number', '5', '--network', 'resnet101'])
Traceback (most recent call last):
File "./image_classification/scaling/pytorch/scaling.py", line 120, in <module>
main()
File "./image_classification/scaling/pytorch/scaling.py", line 74, in main
assert rc == 0, 'Failed to run distributed script'
AssertionError: Failed to run distributed script
requirements.txt lists torchvision version as 0.2.2.
Using torch 1.1.0 and torchvision 0.3.0 gives the stack smashing error as well
--------------------------------------------------------------------------------
batch_size: 64
cuda: True
workers: 0
seed: 0
devices: 1
repeat: 10
number: 5
report: None
jr_id: 0
vcd: 0
cpu_cores: 10
--------------------------------------------------------------------------------
Initializing process group...
Rendezvous complete. Created process group...
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
(134, ['CUDA_VISIBLE_DEVICES=0', '/opt/conda/envs/mlperf/bin/python', '-u', './image_classification/scaling/pytorch/micro_bench.py', '--distributed_dataparallel', '--rank', '0', '--world-size', '1', '--dist-backend', 'nccl', '--dist-url', 'tcp://localhost:8181', '--batch-size', '64', '--number', '5', '--network', 'resnet101'])
Traceback (most recent call last):
File "./image_classification/scaling/pytorch/scaling.py", line 120, in <module>
main()
File "./image_classification/scaling/pytorch/scaling.py", line 74, in main
assert rc == 0, 'Failed to run distributed script'
AssertionError: Failed to run distributed script
Thanks, I updated the scaling to ignore segmentation models if absent. Can you retry with 1.3.0 ?
Yes, torchvision needed to be 0.2.2
because they added some cuda code after that version and it is not yet compatible with ROCm.
The version will be bumped only after torchvision starts supporting ROCm.
Unfortunately I am still running into the stack smashing error. Currently using torch 1.3.0
--------------------------------------------------------------------------------
batch_size: 64
cuda: True
workers: 0
seed: 0
devices: 1
repeat: 10
number: 5
report: None
jr_id: 0
vcd: 0
cpu_cores: 10
--------------------------------------------------------------------------------
Initializing process group...
Rendezvous complete. Created process group...
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
(134, ['CUDA_VISIBLE_DEVICES=0', '/opt/conda/envs/mlperf/bin/python', '-u', './image_classification/scaling/pytorch/micro_bench.py', '--distributed_dataparallel', '--rank', '0', '--world-size', '1', '--dist-backend', 'nccl', '--dist-url', 'tcp://localhost:8181', '--batch-size', '64', '--number', '5', '--network', 'resnet101'])
Traceback (most recent call last):
File "./image_classification/scaling/pytorch/scaling.py", line 120, in <module>
main()
File "./image_classification/scaling/pytorch/scaling.py", line 74, in main
assert rc == 0, 'Failed to run distributed script'
AssertionError: Failed to run distributed script
If I use requirements.txt as is, I get this message
ERROR: torchvision 0.2.2 has requirement tqdm==4.19.9, but you'll have tqdm 4.31.1 which is incompatible.
Replacing tqdm with v4.19.9 also leads to the same stack smashing error
tqdm is only used to show progress bars. Are you running on a standard x86_64 machine ?
A stack smashing error is an indication of a buffer overflow in low-level C & C++ code, and cannot normally be provoked from Python code. Further, the examples that are failing are specifically the "distributed" ones.
That suggests the problem lies outside this Python repo, and specifically with the distributed module code. It could be that some library related to distributed computing protocols like MPI is miscompiled, or the wrong version of it is loaded, or a wrong implementation entirely is loaded. This part is very easy to screw up. Make sure the right libraries are found and loaded. One version or implementation might assume that a data structure is bigger than another assumes it to be, and that disagreement can lead to corruption of the stack or worse.
I would urge you to wrap your failing command with gdb --args
and get a backtrace at the point of the abort, or else to do gdb path/to/corefile
and get a backtrace. info proc mappings
will tell you all the libraries that are currently loaded.
Thank you @obilaniu and @Delaunay The problem was resolved by using a different base image.
Out of these 19 benchmarks, convnet_distributed, convnet_distibuted_fp16 and dcgan_all fail to run. The default parameters are used here.
stack smashing detected : terminated
Aborted (core dumped)
Traceback (most recent call last):
File "run.py", line 162, in run_job_def
run_job(cmd, config, group, definition['name'])
File "run.py", line 105, in run_job
subprocess.check_call(f"{prefix} {cmd} {config} --seed {opt.uid + device_count}", shell=True, env=env)
File "/opt/conda/envs/mlperf/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cgexec -g memory:all nocache ./image_classification/convnets/pytorch/run_distributed.sh --cuda --report $OUTPUT_DIRECTORY/$JOB_FILE --arch resnet101 --workers 8 --batch-size 64 --number 5 --repeat 15 $DATA_DIRECTORY/ImageNet/train --seed 8' returned non-zero exit status 134.
./image_classification/convnets/pytorch/run_distributed.sh
[17/19][ / ] FAILED | 0.88 MIN | ./image_classification/convnets/pytorch/run_distributed.sh
Total Time 52.82 s