microsoft / superbenchmark

A validation and profiling tool for AI infrastructure
https://aka.ms/superbench
MIT License
269 stars 59 forks source link

default gpu_burn test fails with cp error #565

Closed ecidon closed 1 year ago

ecidon commented 1 year ago

What's the issue, what's expected?: while running the default tests on an 8xA100 node I want to benchmark I get the following issue:

[0]: cp: '/opt/superbench/bin/compare.ptx' and './compare.ptx' are the same file

How to reproduce it?: sb run -f inventory.ini -c superbench/config/default.yaml

Log message or shapshot?:

[2023-10-12 15:31:03,273 control_node:38601][ansible.py:80][INFO] Run succeed, return code 0.
[2023-10-12 15:31:03,275 control_node:38601][runner.py:446][INFO] Runner is going to run gpu-burn in local mode, proc rank 0.
[2023-10-12 15:31:03,275 control_node:38601][ansible.py:110][INFO] Run docker exec --env-file /tmp/sb.env sb-workspace bash -c 'PROC_RANK=0 sb exec --output-dir outputs/2023-10-12_15-30-19 -c sb.config.yaml -C superbench.enable=gpu-burn' on remote ...
[2023-10-12 15:31:03,275 control_node:38601][ansible.py:74][INFO] Run as sudo ...
202.78.161.59 | CHANGED | rc=0 >>
[2023-10-12 22:31:08,089 test_node1:5503][monitor.py:118][INFO] Start monitoring.
[2023-10-12 22:31:08,088 test_node1:5490][executor.py:248][INFO] Executor is going to execute gpu-burn.
[2023-10-12 22:31:12,137 test_node1:5490][micro_base.py:177][INFO] Execute command - round: 0, benchmark: gpu-burn, command: cp /opt/superbench/bin/compare.ptx ./ && /opt/superbench/bin/gpu_burn -d -tc 300  && rm compare.ptx.
[0]: cp: '/opt/superbench/bin/compare.ptx' and './compare.ptx' are the same file
[2023-10-12 22:31:12,156 test_node1:5490][micro_base.py:186][ERROR] Microbenchmark execution failed - round: 0, benchmark: gpu-burn, error message: cp: '/opt/superbench/bin/compare.ptx' and './compare.ptx' are the same file
.
[2023-10-12 22:31:12,156 test_node1:5490][executor.py:133][INFO] benchmark: gpu-burn, return code: 32, result: {'return_code': [32]}.
[2023-10-12 22:31:12,156 test_node1:5490][executor.py:140][ERROR] Executor failed in gpu-burn.
202.78.161.242 | CHANGED | rc=0 >>
[2023-10-12 22:31:08,150 test_node2:6872][executor.py:248][INFO] Executor is going to execute gpu-burn.
[2023-10-12 22:31:08,150 test_node2:6887][monitor.py:118][INFO] Start monitoring.
[2023-10-12 22:31:12,209 test_node2:6872][micro_base.py:177][INFO] Execute command - round: 0, benchmark: gpu-burn, command: cp /opt/superbench/bin/compare.ptx ./ && /opt/superbench/bin/gpu_burn -d -tc 300  && rm compare.ptx.
[0]: cp: '/opt/superbench/bin/compare.ptx' and './compare.ptx' are the same file
[2023-10-12 22:31:12,229 test_node2:6872][micro_base.py:186][ERROR] Microbenchmark execution failed - round: 0, benchmark: gpu-burn, error message: cp: '/opt/superbench/bin/compare.ptx' and './compare.ptx' are the same file
.
[2023-10-12 22:31:12,229 test_node2:6872][executor.py:133][INFO] benchmark: gpu-burn, return code: 32, result: {'return_code': [32]}.
[2023-10-12 22:31:12,230 test_node2:6872][executor.py:140][ERROR] Executor failed in gpu-burn.
[2023-10-12 15:31:19,076 control_node:38601][ansible.py:80][INFO] Run succeed, return code 0.
[2023-10-12 15:31:19,077 control_node:38601][ansible.py:128][INFO] Run playbook fetch_results.yaml ...

Additional information: sb version: v0.9.0 sb image: superbench:v0.9.0-cuda12.1 test node info: Ubuntu 22.04 CUDA version is 12.2 Docker version 24.0.6 nvidia-container-toolkit version 1.14.2-1

cp5555 commented 1 year ago

We create one PR to fix it. Would you please check https://github.com/microsoft/superbenchmark/pull/567?

ecidon commented 1 year ago

@cp5555 That worked thank you!