nvidia-riva / python-clients

Riva Python client API and CLI utils
MIT License
73 stars 26 forks source link

Low GPU utilization of riva_streaming_asr_client.py #3

Closed Squire-tomsk closed 2 years ago

Squire-tomsk commented 2 years ago

Hi, I am using nvidia/riva/rmir_asr_conformer_en_us_str:1.9.0-beta model. I have used cpp and python streaming clients and track GPU utilization by this command nvidia-smi dmon -i 1 -s mu -d 5 -o TD.

For python client:

python riva_streaming_asr_client.py 
--num-iterations 1000
--num-clients 10
--input-file "<filepath>"

#Date       Time        gpu    fb  bar1    sm   mem   enc   dec
#YYYYMMDD   HH:MM:SS    Idx    MB    MB     %     %     %     %
 20220624   13:19:28      1  6257     4    61    22     0     0
 20220624   13:19:33      1  6257     4    74    25     0     0
 20220624   13:19:38      1  6257     4    76    25     0     0
 20220624   13:19:43      1  6257     4    39    13     0     0
 20220624   13:19:48      1  6257     4    68    23     0     0
 20220624   13:19:53      1  6257     4    64    23     0     0
 20220624   13:19:58      1  6257     4    89    29     0     0
 20220624   13:20:03      1  6257     4    63    21     0     0
 20220624   13:20:08      1  6257     4    70    24     0     0
 20220624   13:20:13      1  6257     4    89    29     0     0
 20220624   13:20:18      1  6257     4    90    34     0     0

For cpp client:

riva_streaming_asr_client \
--chunk_duration_ms=1600 \
--simulate_realtime=false \
--automatic_punctuation=false \
--num_parallel_requests=10 \
--print_transcripts=false \
--interim_results=false \
--num_iterations=1000 \
--audio_file="<filepath>"

#Date       Time        gpu    fb  bar1    sm   mem   enc   dec
#YYYYMMDD   HH:MM:SS    Idx    MB    MB     %     %     %     %
20220624   13:25:58      1  6257     4    93    51     0     0
20220624   13:26:03      1  6257     4    93    52     0     0
20220624   13:26:08      1  6257     4    92    50     0     0
20220624   13:26:13      1  6257     4    93    51     0     0
20220624   13:26:18      1  6257     4    93    50     0     0
20220624   13:26:23      1  6257     4    93    49     0     0
20220624   13:26:28      1  6257     4    92    46     0     0
20220624   13:26:33      1  6257     4    92    50     0     0
20220624   13:26:38      1  6257     4    93    50     0     0
20220624   13:26:43      1  6257     4    92    50     0     0

Same file, same settings, but cpp client use GPU much better. Could you explain how to improve GPU utilization for python client? I've tried to change --num-clients to bigger values, doesn't help. Audio duration – 30s.

PeganovAnton commented 2 years ago

Hi @Squire-tomsk ! I suspect that the problem is in Python GIL (in fact audio is read and sent in 1 thread). I think that this can be fixed with if I replace threading with multiprocessing. I look into the problem closer tomorrow.

Could you please describe your use case?

PeganovAnton commented 2 years ago

Other possible reason is that audio file is read again on every iteration. https://github.com/nvidia-riva/python-clients/blob/928c63273176a939500e01ce176c463f1606a1ff/scripts/asr/riva_streaming_asr_client.py#L68

Squire-tomsk commented 2 years ago

I have written code where all preparation makes outside ThreadPool. Effect is same.

https://gist.github.com/Squire-tomsk/dd23a5f3079a4f9a84adf41538442f8c

Current use case is just load test, but later service will used by different clients.

PeganovAnton commented 2 years ago

Calculated an average utilization using:

n=100
nvidia-smi dmon -s u -i 0 -c "${n}" | awk \
  -F ' *' \
  -v n="${n}" \
  '$3 ~ /^[0-9]/ {count+=$3}END{printf "Average : %f\n",count / n}'

on file https://github.com/nvidia-riva/python-clients/blob/main/data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav Client and server were run on same machine.

C++ client:

riva_streaming_asr_client \
  -audio_file en-US_AntiBERTa_for_word_boosting_testing.wav \
  -num_iterations 2000 \
  -num_parallel_requests 100 1>output.txt

81 %

Python client:

python scripts/asr/riva_streaming_asr_client.py \
  --input-file examples/en-US_AntiBERTa_for_word_boosting_testing.wav \
  --num-clients 100 \
  --num-iterations 2000

70%

Squire-tomsk commented 2 years ago

Could you measure standard deviation too? I suppose it will be obviously higher for python client.

PeganovAnton commented 2 years ago

So far, neither multiprocessing nor preloading audio file gave significant improvement.

You probably misunderstood parameter --file-streaming-chunk. Unlike -chunk_duration_ms, it shows number of frames, not ms.

I ran same commands with increased --num-iterations=2000 and number of utilization points n=1000.

n=1000
nvidia-smi dmon -s u -i 0 -c "${n}" | awk \
  -F ' *' \
  -v n="${n}" \
  '$3 ~ /^[0-9]/ {sum_+=$3;sum_squares+=$3*$3} END \
  {\
    mean=sum_/n; \
    printf "Average: %.1f\n", mean; \
    stddev=sqrt(sum_squares/n - mean*mean); \
    printf "stddev : %.1f\n", stddev; \
    printf "mean stddev : %.1f\n", stddev / sqrt(n)}'

C++

riva_streaming_asr_client \
  -audio_file en-US_AntiBERTa_for_word_boosting_testing.wav \
  -num_iterations 20000 \
  -num_parallel_requests 100 \
  1>output.txt

Mean utilization: 77.5% Utilization stddev: 15% Mean stddev: 0.5%

Python

python scripts/asr/riva_streaming_asr_client.py \
  --input-file data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav \
  --num-clients 100 \
  --num-iterations 20000 \
  --file-streaming-chunk 4800

Mean utilization: 69.5% Utilization stddev: 17% Mean stddev: 0.5% Default -chunk_duration_ms in riva_streaming_asr_client is 100ms. Which corresponds to 4800 frames for file data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav.

For branch https://github.com/nvidia-riva/python-clients/tree/audio_chunk_streaming

All audio frames are loaded before iterating over them. Mean utilization: 72.6% Utilization stddev: 16% Mean stddev: 0.5%

I adapted your gist to test threading vs multiprocessing

threading

python load_test.py \
  --input-path python-clients/data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav \
  --num-iterations=20000 \
  --min-connections 100 \
  --max-connections 100

Mean utilization: 70.8% Utilization stddev: 15% Mean stddev: 0.5%

multiprocessing

python load_test.py \
  --input-path python-clients/data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav \
  --num-iterations=20000 \
  --min-connections 100 \
  --max-connections 100 \
  --multiprocessing

Mean utilization: 72.3% Utilization stddev: 15% Mean stddev: 0.5%

I will investigate further.

PeganovAnton commented 2 years ago

Updated comment above: added stddev.

PeganovAnton commented 2 years ago

Updated riva_streaming_asr_client.py in branch audio_chunk_streaming with parameter --no-threads (no threads are created in the script). Then ran the script with cProfile:

python -m cProfile -o profile.out scripts/asr/riva_streaming_asr_client.py \
  --input-file data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav \
  --no-threads \
  --num-iterations 20 \
  --file-streaming-chunk 4800

The results:

import pstats
from pstats import SortKey
p = pstats.Stats(r'python-clients\profile.out')
p.sort_stats(SortKey.CUMULATIVE).print_stats()
Sun Jun 26 11:12:47 2022    python-clients\profile.out

         205323 function calls (203421 primitive calls) in 9.293 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    147/1    0.002    0.000    9.293    9.293 {built-in method builtins.exec}
        1    0.000    0.000    9.293    9.293 scripts/asr/riva_streaming_asr_client.py:4(<module>)
        1    0.000    0.000    9.083    9.083 scripts/asr/riva_streaming_asr_client.py:94(main)
        1    0.000    0.000    9.078    9.078 scripts/asr/riva_streaming_asr_client.py:49(streaming_transcription_worker)
        1    0.051    0.051    9.068    9.068 C:\Users\apeganov\python-clients\riva_api\asr.py:162(print_streaming)
     1847    0.003    0.000    8.997    0.005 C:\Users\apeganov\python-clients\riva_api\asr.py:327(streaming_response_generator)
     1847    0.003    0.000    8.990    0.005 C:\Users\apeganov\Anaconda3\envs\riva-python-clients\lib\site-packages\grpc\_channel.py:425(__next__)
     1847    0.020    0.000    8.986    0.005 C:\Users\apeganov\Anaconda3\envs\riva-python-clients\lib\site-packages\grpc\_channel.py:795(_next)
     1847    0.013    0.000    8.909    0.005 C:\Users\apeganov\Anaconda3\envs\riva-python-clients\lib\site-packages\grpc\_common.py:111(wait)
     7221    0.028    0.000    8.884    0.001 C:\Users\apeganov\Anaconda3\envs\riva-python-clients\lib\site-packages\grpc\_common.py:105(_wait_once)
     7223    0.034    0.000    8.859    0.001 C:\Users\apeganov\Anaconda3\envs\riva-python-clients\lib\threading.py:270(wait)
    14452    8.592    0.001    8.592    0.001 {method 'acquire' of '_thread.lock' objects}
    182/7    0.001    0.000    0.216    0.031 <frozen importlib._bootstrap>:986(_find_and_load)

It looks like, Python implementation of gRPC uses threads, and most of time is spent on waiting.

PeganovAnton commented 2 years ago

Updated gist test_load_multiprocessing.py in a way that channel is not recreated every iteration. This did not make any difference in multiprocessing mode.

I will ask colleagues how to fix it.

PeganovAnton commented 2 years ago

That is what I found out.

  1. The python clients aren’t really intended to achieve max perf on their own — we have the c++ clients for those kinds of benchmarking purposes
  2. If you ran enough instances of the python clients you can saturate the server (that is to say the expected use case for riva is serving multiple clients)
  3. Multiprocessing isn’t going to help because of the way grpc is implemented for python… you’ll need to do async calls - this is pretty trivial for unary requests — you can just return a future. It is harder for streaming mode though.

When I ran 2 scripts I got 74% instead of 72%.

Squire-tomsk commented 2 years ago

Excuse me, could you possibly explain the difference between 2 and 3? Both ways lead to separated python processes and GIL should`t affect.

ryanleary commented 2 years ago

The python clients are meant to serve as minimal examples of how you can perform streaming inference. If you're looking to try to saturate the server, we recommend using the cpp clients.