opea-project / GenAIComps

GenAI components at micro-service level; GenAI service composer to create mega-service
Apache License 2.0
41 stars 97 forks source link

Latency with ASR STT #391

Closed endomorphosis closed 1 week ago

endomorphosis commented 1 month ago

Whisper web takes 2 seconds to transcribe one second of audio in my browser.

ASR in habana takes 2 seconds to transcribe one second of audio through the API.

Can we change or optimize the whisper model so that the latency is reduced?

feng-intel commented 1 month ago

2 seconds is not expected. I will check it.

endomorphosis commented 1 month ago
import requests
import json
import time
from pydub import AudioSegment
import io
import base64
import numpy as np

def process_audio(audio_path):
    start_time = time.time()

    # Load and convert audio to 16000 Hz mono
    waveform = AudioSegment.from_file(audio_path).set_frame_rate(16000).set_channels(1)

    # Convert AudioSegment to numpy array
    samples = np.array(waveform.get_array_of_samples())

    # Convert numpy array to bytes
    byte_io = io.BytesIO()
    waveform.export(byte_io, format="wav")
    audio_bytes = byte_io.getvalue()

    # Encode audio bytes to base64
    audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')

    # Prepare the API request
    url = "http://159.69.148.218/v1/audio/transcriptions"
    headers = {'Content-Type': 'application/json'}
    data = {
        "byte_str": audio_base64
    }

    # Send the request to the API
    response = requests.post(url, headers=headers, data=json.dumps(data))

    # Check if the request was successful
    if response.status_code == 200:
        transcript = response.json().get('text', '')
    else:
        transcript = f"Error: {response.status_code} - {response.text}"

    end_time = time.time()
    execution_time = end_time - start_time
    return transcript, execution_time

image

endomorphosis commented 1 month ago

image

I suggest https://huggingface.co/distil-whisper/distil-large-v3

endomorphosis commented 1 month ago

image

The issue is that intel devcloud wont let me stage a kubernetes cluster, so i am using a reverse proxy via digital ocean Seattle, and i am in portland oregon . The reverse proxy is adding some latency overhead, and the user (Christoph Schuhmann) was in Germany, so I think the rest of the latency is a result of his location and his python environment.

image

I tried using distill whisper, but this did not improve the latency with the HPU (presumably because it has enough memory bandwidth to make it not a limiting factor, compared to the CPU/network latency)

endomorphosis commented 1 month ago

tested with a one second wav file Groq image

openai image

Cloudflare image tested with bytestring example

Xeon localhost image

reverse proxy devcloud-> seattle (digital ocean) -> intel devcloud -> seattle (digital ocean) -> devcloud

image

HPU top (reverse proxy) bottom: localhost image

feng-intel commented 1 month ago

If you care about the performance, I suggest firstly not use server/client becuase of network latency. Here has some test cases and profiling data: https://github.com/huggingface/optimum-habana/tree/main/examples/speech-recognition https://developer.habana.ai/get-started/habana-models-performance/

endomorphosis commented 1 month ago

Yes, I agree, but i cannot always rely on the the enduser environment being able to run onnx models performantly.

I have also told him that he should be developing this "classroom buddy" as a javascript frontend with onnx backend and not a python application, but unfortunately the people who are doing most of the development are college students, and I do not pay them so I cannot boss them around.

I can only try to control the chaos, by forcing them to use OPEA, and keeping the UI / application / inference logic separate. image

endomorphosis commented 1 month ago

image

Christoph reports that even with the provided asr example, that the response takes too long, and was wondering whether it would be possible to do 8xGaudi with batched inference for the ASR service to make it as fast as groq (or even openai api)

time to first byte (from germany to intel devcloud region1 / region2) image

import base64
import json
import requests
import time

# Read and encode the file
with open("temp_speech.wav", "rb") as f:
    test_audio_base64_str = base64.b64encode(f.read()).decode("utf-8")

endpoint = "http://62.146.169.111/v1/audio/transcriptions"
inputs = {"byte_str": test_audio_base64_str}

# Measure time for the API call
start_time = time.time()

response = requests.post(
    url=endpoint, 
    data=json.dumps(inputs), 
    headers={'Content-Type': 'application/json'},
    proxies={"http": None}
)

end_time = time.time()

# Calculate the API call duration
api_call_duration = end_time - start_time

# Print the JSON response
print("API Response:")
try:
    print(response.json())
except json.JSONDecodeError:
    print("Failed to decode JSON. Raw response:")
    print(response.text)

# Print the time taken
print(f"\nAPI call took {api_call_duration:.2f} seconds")

# Print additional response information
print(f"\nStatus Code: {response.status_code}")
print(f"Response Headers: {dict(response.headers)}")

@feng-intel

endomorphosis commented 1 month ago

https://github.com/Vaibhavs10/insanely-fast-whisper

This is an example of a optimum (cuda) based batch inference solution

feng-intel commented 1 month ago

Christoph reports that even with the provided asr example, that the response takes too long

Who is Christoph ? What's your platform? Which server/dockerfile you used? This? So I can reproduce the issue in my side. How fast do you expect ?

endomorphosis commented 1 month ago

Christoph is a part of the Intel Center for Excellence

https://www.linkedin.com/in/christoph-schuhmann-59a740235/?originalSubdomain=de https://www.bloomberg.com/news/features/2023-04-24/a-high-school-teacher-s-free-image-database-powers-ai-unicorns

He is working on a project called Bud-E https://youtu.be/fTpVsN6dUNM

The goal is to get the latency down for larger models and retrieval, but he would like to keep the whisper transcription part to less than 500ms, so as to keep the time to produce the first audio bytes output to below 2 seconds for the entire pipeline including llm and retrieval.

feng-intel commented 1 month ago

I can only try to control the chaos, by forcing them to use OPEA, and keeping the UI / application / inference logic separate

Yes. OPEA supports not only Intel GPU, Gaudi, CPU, but also support Nvidia, AMD platforms.

In my platform gaudi2,

+-----------------------------------------------------------------------------+ | HL-SMI Version: hl-1.16.2-rc-fw-50.1.2.0 | | Driver Version: 1.16.2-f195ec4 | |-------------------------------+----------------------+----------------------+ | AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. | |===============================+======================+======================| | 0 HL-225 N/A | 0000:33:00.0 N/A | 0 | | N/A 29C N/A 98W / 600W | 768MiB / 98304MiB | 0% N/A |

  1. Follow steps to start docker, $ docker run -it -p 7066:7066 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/whisper-gaudi:latest bash ......Downloading model: openai/whisper-small
  2. Check whisper server. $ python check_whisper_server.py time: 0.17714595794677734 {'asr_result': 'who is pat gelsinger'}

If you still have problems, please follow this steps and give your performance data.

endomorphosis commented 1 month ago

I guess this is more of a feature request than a bug fix, the request is that the ASR STT itself supports batch inference, and that the "insanely fast whisper" substitute optimum for optimum habana gaudi, because the batch inference solution the package provides can perform 150 minutes of transcription in 1 min 18 sec, or rather 115x real time speed vs 6x real time speed for the current implementation.

feng-intel commented 1 month ago

How do you get "6x real time speed"? Can you provide your steps?

"150 minutes of transcription in 1 min 18 sec", how or where do you get this data?

endomorphosis commented 1 month ago

How do you get "6x real time speed"? Can you provide your steps?

I assume your audio is one second and divide 1 / 0.17714595794677734

"150 minutes of transcription in 1 min 18 sec", how or where do you get this data?

I assume that a Gaudi GPU is about as fast an an A100. https://github.com/Vaibhavs10/insanely-fast-whisper

image

endomorphosis commented 1 month ago

https://x.com/reach_vb/status/1820560137892835369 image

feng-intel commented 1 month ago

Thanks for your info and suggestion. Currently, the main application scenarios of OPEA are Q&A, take this for example, so it does not support batching now.

You can ask static batching/continuous batching in tgi-gaudi. I believe gaudi2's performance benchmark is closed to A100 (115x real time speed).

endomorphosis commented 1 month ago

yes, but the number of concurrent users per microservice and the latency could be vastly improved, I understand that I'm not your boss, but this is a feature that I would like.

I am working on getting a feature for llama 405b working, so that I can work on a graphrag microservice

feng-intel commented 1 month ago

What's your pipeline, and which models in your microservices besides llama405b/microservice ? Which company are you from? Or it's your personal project?

endomorphosis commented 1 month ago

I have a company in Oregon that is trying to do legal automation with a virtual avatar, because of an acute lawyer shortage in Oregon, I used to work as a cloud infrastructure engineer at Jones Farm.

I volunteer with LAION.AI (Intel COE) to publish open source 3d animated avatars, and finetune audio tokens into llama3 to improve latency I volunteer with libp2p / ipfs / protocol labs / filecoin foundation to create a new open source peer to peer based mlops system from scratch focusing on privacy preserving edge compute. I volunteer with the Stanford Codex computational law research paper reading group, and I was trying to have llama 405B fp8 / opea ready to demo for the stanford law x llm hackathon. I volunteer with the Yannic Kilcher machine learning discord weekly/daily research paper reading along with Desmond Greely at Intel liftoff.

My cofounder works for Edgerunner.ai (Intel COE), where he has a license to my edge focused mlops project

The other cofounder runs a cryptocurrency exchange, and a video game product.

Spycsh commented 1 month ago

Hi @endomorphosis , thanks for raising this issue. The original data you showed is 2 seconds for HPU inference of 1 second audio. Since you found it was because of the network latency, let's proceed to the context when it took ~3 seconds to transcribe the text given by Christopher. This could happen because Habana currently needs a warmup before doing inference. I actually use a 15 seconds input audio to do the warmup so any input audio that less than 15 seconds (may not be exact but close to the upper bound) can fully get the use of HPU graph. Maybe Christopher's input audio is larger than 15 seconds so it may be the reason that you see it takes ~3 seconds to do an inference of the given text. One simple check is that you can run multiple times the same input audio and check whether the latency drops a lot.

Simultaneously, I will try to reproduce and figure out a proper way to narrow down the Whisper inference time for HPU for a single instance, and fix any issues that I encounters.

Apart from the HPU issue, the latency of ASR can be improved in other few aspects that we may contribute to:

  1. inference parallelism (as you said, multiple requests of different audio lengths can be batched together to do an inference)
  2. split long audio into multiple segments and batch the segments to do an inference (as the repo you mentioned shows, correct me if I'm wrong!)
  3. update models (whisper-large, whisper-small, whisper-tiny, distil-whisper etc.)
  4. continuous batching of Whisper model (utilizing the inference frameworks such as vLLM, TGI, etc.)
  5. Low-precision (bf16, fp16, fp8, int8...)
  6. Scale out more microservice instances for serving

Regarding to 4,5, I believe them are some study already and will tune to the latest open-source study and integrate it asap. Regarding to 3, I previously also try distil-whisper but I do not see a clear latency drop, so I think you can choose the model that you try mostly fit your demand. Regarding to 1, I believe it is an alternative of 4 so we will measure which way is a better way for Whisper. Regarding to 2, yes for OPEA, we currently mainly consider the Talkingbot daily conversation with LLMs so it assumes user do not need to say a super long speech as input. But would be good to support that. Regarding to 6, yes you can try scaling out multiple ASR instances on multiple Gaudi cards.

I'm currently also doing the the related performance tuning stuff of AudioQnA and try to improve latency and throughput for ASR/TTS. Could you also share your scenario so I can also consider your scenario/request of features into the whole design and implementation?

Tell me if you have any other questions!