Closed Squire-tomsk closed 2 years ago
Hi @Squire-tomsk ! I suspect that the problem is in Python GIL (in fact audio is read and sent in 1 thread). I think that this can be fixed with if I replace threading
with multiprocessing
. I look into the problem closer tomorrow.
Could you please describe your use case?
Other possible reason is that audio file is read again on every iteration. https://github.com/nvidia-riva/python-clients/blob/928c63273176a939500e01ce176c463f1606a1ff/scripts/asr/riva_streaming_asr_client.py#L68
I have written code where all preparation makes outside ThreadPool. Effect is same.
https://gist.github.com/Squire-tomsk/dd23a5f3079a4f9a84adf41538442f8c
Current use case is just load test, but later service will used by different clients.
Calculated an average utilization using:
n=100
nvidia-smi dmon -s u -i 0 -c "${n}" | awk \
-F ' *' \
-v n="${n}" \
'$3 ~ /^[0-9]/ {count+=$3}END{printf "Average : %f\n",count / n}'
on file https://github.com/nvidia-riva/python-clients/blob/main/data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav Client and server were run on same machine.
riva_streaming_asr_client \
-audio_file en-US_AntiBERTa_for_word_boosting_testing.wav \
-num_iterations 2000 \
-num_parallel_requests 100 1>output.txt
81 %
python scripts/asr/riva_streaming_asr_client.py \
--input-file examples/en-US_AntiBERTa_for_word_boosting_testing.wav \
--num-clients 100 \
--num-iterations 2000
70%
Could you measure standard deviation too? I suppose it will be obviously higher for python client.
So far, neither multiprocessing
nor preloading audio file gave significant improvement.
You probably misunderstood parameter --file-streaming-chunk
. Unlike -chunk_duration_ms
, it shows number of frames, not ms.
I ran same commands with increased --num-iterations=2000
and number of utilization points n=1000
.
n=1000
nvidia-smi dmon -s u -i 0 -c "${n}" | awk \
-F ' *' \
-v n="${n}" \
'$3 ~ /^[0-9]/ {sum_+=$3;sum_squares+=$3*$3} END \
{\
mean=sum_/n; \
printf "Average: %.1f\n", mean; \
stddev=sqrt(sum_squares/n - mean*mean); \
printf "stddev : %.1f\n", stddev; \
printf "mean stddev : %.1f\n", stddev / sqrt(n)}'
riva_streaming_asr_client \
-audio_file en-US_AntiBERTa_for_word_boosting_testing.wav \
-num_iterations 20000 \
-num_parallel_requests 100 \
1>output.txt
Mean utilization: 77.5% Utilization stddev: 15% Mean stddev: 0.5%
python scripts/asr/riva_streaming_asr_client.py \
--input-file data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav \
--num-clients 100 \
--num-iterations 20000 \
--file-streaming-chunk 4800
Mean utilization: 69.5%
Utilization stddev: 17%
Mean stddev: 0.5%
Default -chunk_duration_ms
in riva_streaming_asr_client
is 100ms. Which corresponds to 4800 frames for file data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav
.
All audio frames are loaded before iterating over them. Mean utilization: 72.6% Utilization stddev: 16% Mean stddev: 0.5%
threading
vs multiprocessing
python load_test.py \
--input-path python-clients/data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav \
--num-iterations=20000 \
--min-connections 100 \
--max-connections 100
Mean utilization: 70.8% Utilization stddev: 15% Mean stddev: 0.5%
python load_test.py \
--input-path python-clients/data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav \
--num-iterations=20000 \
--min-connections 100 \
--max-connections 100 \
--multiprocessing
Mean utilization: 72.3% Utilization stddev: 15% Mean stddev: 0.5%
I will investigate further.
Updated comment above: added stddev.
Updated riva_streaming_asr_client.py in branch audio_chunk_streaming with parameter --no-threads
(no threads are created in the script). Then ran the script with cProfile
:
python -m cProfile -o profile.out scripts/asr/riva_streaming_asr_client.py \
--input-file data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav \
--no-threads \
--num-iterations 20 \
--file-streaming-chunk 4800
The results:
import pstats
from pstats import SortKey
p = pstats.Stats(r'python-clients\profile.out')
p.sort_stats(SortKey.CUMULATIVE).print_stats()
Sun Jun 26 11:12:47 2022 python-clients\profile.out
205323 function calls (203421 primitive calls) in 9.293 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
147/1 0.002 0.000 9.293 9.293 {built-in method builtins.exec}
1 0.000 0.000 9.293 9.293 scripts/asr/riva_streaming_asr_client.py:4(<module>)
1 0.000 0.000 9.083 9.083 scripts/asr/riva_streaming_asr_client.py:94(main)
1 0.000 0.000 9.078 9.078 scripts/asr/riva_streaming_asr_client.py:49(streaming_transcription_worker)
1 0.051 0.051 9.068 9.068 C:\Users\apeganov\python-clients\riva_api\asr.py:162(print_streaming)
1847 0.003 0.000 8.997 0.005 C:\Users\apeganov\python-clients\riva_api\asr.py:327(streaming_response_generator)
1847 0.003 0.000 8.990 0.005 C:\Users\apeganov\Anaconda3\envs\riva-python-clients\lib\site-packages\grpc\_channel.py:425(__next__)
1847 0.020 0.000 8.986 0.005 C:\Users\apeganov\Anaconda3\envs\riva-python-clients\lib\site-packages\grpc\_channel.py:795(_next)
1847 0.013 0.000 8.909 0.005 C:\Users\apeganov\Anaconda3\envs\riva-python-clients\lib\site-packages\grpc\_common.py:111(wait)
7221 0.028 0.000 8.884 0.001 C:\Users\apeganov\Anaconda3\envs\riva-python-clients\lib\site-packages\grpc\_common.py:105(_wait_once)
7223 0.034 0.000 8.859 0.001 C:\Users\apeganov\Anaconda3\envs\riva-python-clients\lib\threading.py:270(wait)
14452 8.592 0.001 8.592 0.001 {method 'acquire' of '_thread.lock' objects}
182/7 0.001 0.000 0.216 0.031 <frozen importlib._bootstrap>:986(_find_and_load)
It looks like, Python implementation of gRPC uses threads, and most of time is spent on waiting.
Updated gist test_load_multiprocessing.py in a way that channel is not recreated every iteration. This did not make any difference in multiprocessing mode.
I will ask colleagues how to fix it.
That is what I found out.
When I ran 2 scripts I got 74% instead of 72%.
Excuse me, could you possibly explain the difference between 2 and 3? Both ways lead to separated python processes and GIL should`t affect.
The python clients are meant to serve as minimal examples of how you can perform streaming inference. If you're looking to try to saturate the server, we recommend using the cpp clients.
Hi, I am using
nvidia/riva/rmir_asr_conformer_en_us_str:1.9.0-beta
model. I have used cpp and python streaming clients and track GPU utilization by this commandnvidia-smi dmon -i 1 -s mu -d 5 -o TD
.For python client:
For cpp client:
Same file, same settings, but cpp client use GPU much better. Could you explain how to improve GPU utilization for python client? I've tried to change
--num-clients
to bigger values, doesn't help. Audio duration – 30s.