mlcommons / inference_results_v2.0

This repository contains the results and code for the MLPerf™ Inference v2.0 benchmark.
https://mlcommons.org/en/inference-datacenter-20/
Apache License 2.0
9 stars 12 forks source link

non-zero exit status 139 for rnnt #10

Closed mahmoodn closed 1 year ago

mahmoodn commented 2 years ago

Hi Is there any clue about the following error in RNNT:

[I] Starting running actual test.
Segmentation fault (core dumped)
Traceback (most recent call last):
  File "code/main.py", line 303, in handle_run_harness
    result = harness.run_harness()
  File "/work/code/common/harness.py", line 264, in run_harness
    output = run_command(cmd, get_output=True, custom_env=self.env_vars)
  File "/work/code/common/__init__.py", line 64, in run_command
    raise subprocess.CalledProcessError(ret, cmd)
subprocess.CalledProcessError: Command './build/bin/harness_rnnt --logfile_outdir="/work/build/logs/2022.07.19-12.45.14/mahmood2022_TRT/rnnt/Offline" --logfile_prefix="mlperf_log_" --performance_sample_count=2513 --audio_batch_size=256 --audio_buffer_num_lines=4096 --dali_batches_issue_ahead=4 --dali_pipeline_depth=4 --num_warmups=512 --mlperf_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/mlperf.conf" --user_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/user.conf" --batch_size=128 --cuda_graph=true --pipelined_execution=true --batch_sorting=true --enable_audio_processing=true --use_copy_kernel=true --streams_per_gpu=1 --audio_fp16_input=true --start_from_device=false --audio_serialized_pipeline_file="build/bin/dali/dali_pipeline_gpu_fp16.pth" --scenario Offline --model rnnt --engine_dir="./build/engines/mahmood2022/rnnt/Offline"' returned non-zero exit status 139.
Traceback (most recent call last):
  File "code/main.py", line 772, in <module>
    main(main_args, DETECTED_SYSTEM)
  File "code/main.py", line 744, in main
    dispatch_action(main_args, config_dict, workload_id, equiv_engine_setting=equiv_engine_setting)
  File "code/main.py", line 574, in dispatch_action
    handle_run_harness(benchmark_conf, need_gpu, need_dla, profile, power)
  File "code/main.py", line 312, in handle_run_harness
    raise RuntimeError("Run harness failed!")
RuntimeError: Run harness failed!

I have no idea how to proceed further to narrow the problem.

nv-ananjappa commented 2 years ago

@nv-etcheng Could you have a look at this?

mahmoodn commented 2 years ago

It seems that this error occurs when I run the benchmark with nsys. I modified the harness target like this

set -o pipefail && /opt/nsight-systems-2022.2.1/bin/nsys profile -t cuda,cudnn,nvtx -o nsys_rnnt $(PYTHON3_CMD) code/main.py $(RUN_ARGS) --action="run_harness" 2>&1 | tee $(LOG_DIR)/stdout.txt;

I checked and there is no problem with the free memory on the system.

nv-alicheng commented 2 years ago

Ahh, I see. You are trying to do an nsys profile on the harness. The issue here is that code/main.py is a Python wrapper that constructs the command for and launches a subprocess for the actual harness binary. What you want to do is run nsys on the subprocess command itself.

When running the make run_harness target, it will print something like:

Running command: ./build/bin/harness_rnnt --logfile_outdir="/work/build/logs/2022.07.19-12.45.14/mahmood2022_TRT/rnnt/Offline" --logfile_prefix="mlperf_log_" ....

You will want to copy that entire command and run nsys profile on that command.

edit: The command is also printed in the error message:

subprocess.CalledProcessError: Command './build/bin/harness_rnnt --logfile_outdir="/work/build/logs/2022.07.19-12.45.14/mahmood2022_TRT/rnnt/Offline" --logfile_prefix="mlperf_log_" --performance_sample_count=2513 --audio_batch_size=256 --audio_buffer_num_lines=4096 --dali_batches_issue_ahead=4 --dali_pipeline_depth=4 --num_warmups=512 --mlperf_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/mlperf.conf" --user_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/user.conf" --batch_size=128 --cuda_graph=true --pipelined_execution=true --batch_sorting=true --enable_audio_processing=true --use_copy_kernel=true --streams_per_gpu=1 --audio_fp16_input=true --start_from_device=false --audio_serialized_pipeline_file="build/bin/dali/dali_pipeline_gpu_fp16.pth" --scenario Offline --model rnnt --engine_dir="./build/engines/mahmood2022/rnnt/Offline"' returned non-zero exit status 139.
mahmoodn commented 2 years ago

I understand that, but I couldn't find the place that the command is built. I assume that there are a lot of variables in scripts and are put together to form the final command and that is printed on the screen as you mentioned. But, the scripts and Makeiles are nested and I am not able to find the exact location.

mahmoodn commented 2 years ago

@nv-etcheng Do you mean to change

    def _get_harness_executable(self):
        return "./build/bin/harness_rnnt"

to

    def _get_harness_executable(self):
        return "nsys profile ./build/bin/harness_rnnt"

? I mean harness.py

nv-alicheng commented 2 years ago

Hi @mahmoodn , you don't need to modify any part of the code. You can just copy the command that the python harness builds and prints, and then run nsys profile <that command>. i.e. if the command printed is ./build/bin/harness_foo --some_flag, you will want to do nsys profile ./build/bin/harness_foo --some_flag.

mahmoodn commented 2 years ago

OK that helps, but I wonder why no gputrace is generated. Please see the complete output:

(mlperf) mnaderan@mlperf-inference-mnaderan-x86_64:/work$ /home/mnaderan/nsight-systems-2022.2.1/bin/nsys profile -t cuda,cudnn,nvtx -o nsys_rnnt ./build/bin/harness_rnnt --logfile_outdir="/work/build/logs/2022.07.21-09.39.05/mahmood2022_TRT/rnnt/Offline" --logfile_prefix="mlperf_log_" --performance_sample_count=2513 --audio_batch_size=256 --audio_buffer_num_lines=4096 --dali_batches_issue_ahead=4 --dali_pipeline_depth=4 --num_warmups=512 --mlperf_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/mlperf.conf" --user_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/user.conf" --batch_size=128 --cuda_graph=true --pipelined_execution=true --batch_sorting=true --enable_audio_processing=true --use_copy_kernel=true --streams_per_gpu=1 --audio_fp16_input=true --start_from_device=false --audio_serialized_pipeline_file="build/bin/dali/dali_pipeline_gpu_fp16.pth" --scenario Offline --model rnnt --engine_dir="./build/engines/mahmood2022/rnnt/Offline"
Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
&&&& RUNNING RNN-T_Harness # /work/./build/bin/harness_rnnt
I0722 08:20:07.714337   762 main_rnnt.cc:2903] Found 1 GPUs
[I] Starting creating QSL.
[I] Finished creating QSL.
[I] Starting creating SUT.
[I] Set to device 0
Dali pipeline creating..
Dali pipeline created
[I] Creating stream 0/1
[I] [TRT] [MemUsageChange] Init CUDA: CPU +531, GPU +0, now: CPU 966, GPU 2704 (MiB)
[I] [TRT] Loaded engine size: 81 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1239, GPU +348, now: CPU 2388, GPU 3054 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +178, GPU +56, now: CPU 2566, GPU 3110 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2593, GPU 3170 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2593, GPU 3178 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +260, now: CPU 0, GPU 260 (MiB)
[I] Created RnntEncoder runner: encoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2593, GPU 3440 (MiB)
[I] [TRT] Loaded engine size: 3 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2599, GPU 3448 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2599, GPU 3458 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 260 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2599, GPU 3462 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2600, GPU 3470 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 262 (MiB)
[I] Created RnntDecoder runner: decoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2600, GPU 3470 (MiB)
[I] [TRT] Loaded engine size: 1 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 263 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 263 (MiB)
[I] Created RnntJointFc1 runner: fc1_a
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2600, GPU 3472 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2601, GPU 3482 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 263 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2601, GPU 3482 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 263 (MiB)
[I] Created RnntJointFc1 runner: fc1_b
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2600, GPU 3482 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2601, GPU 3490 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 263 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2601, GPU 3490 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 263 (MiB)
[I] Created RnntJointBackend runner: joint_backend
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2601, GPU 3490 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2601, GPU 3498 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2601, GPU 3506 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 263 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2601, GPU 3498 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2601, GPU 3506 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 263 (MiB)
[I] Created RnntIsel runner: isel
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2601, GPU 3506 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 263 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +16, now: CPU 0, GPU 279 (MiB)
[I] Created RnntIgather runner: igather
[I] Instantiated RnntEngineContainer runner
cudaMemcpy blocking 
cudaMemcpy blocking 
[I] Instantiated RnntTensorContainer host memory
Stream::Stream sampleSize: 61440
Stream::Stream singleSampleSize: 480
Stream::Stream fullseqSampleSize: 61440
Stream::Stream mBatchSize: 128
[I] Finished creating SUT.
[I] Starting warming up SUT.
[I] Finished warming up SUT.
[I] Starting running actual test.
Generating '/tmp/nsys-report-701f.qdstrm'
[1/1] [========================100%] nsys_rnnt.nsys-rep
Generated:
    /work/nsys_rnnt.nsys-rep
(mlperf) mnaderan@mlperf-inference-mnaderan-x86_64:/work$ /home/mnaderan/nsight-systems-2022.2.1/bin/nsys stats --report gputrace --format csv --output . nsys_rnnt.nsys-rep 
Generating SQLite file nsys_rnnt.sqlite from nsys_rnnt.nsys-rep
Exporting 7415602 events: [================================================100%]
Using nsys_rnnt.sqlite for SQL queries.
Running [/home/mnaderan/nsight-systems-2022.2.1/target-linux-x64/reports/gputrace.py nsys_rnnt.sqlite] to [nsys_rnnt_gputrace.csv]... SKIPPED: nsys_rnnt.sqlite does not contain GPU trace data.

I used cuda,cudnn,nvtx in the profile option similar to other workloads. Any thoughts on that?