mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.18k stars 518 forks source link

Speech Recognition Wishlist #762

Closed galv closed 2 years ago

galv commented 3 years ago

For those who don't know, I wrote the inference benchmark in ASR for v0.7.

It had two main issues: 

(1) LoadGen, the test harness for measuring performance did not support streaming models (so the only valid "scenario" was the Offline one, where only throughput counts).

(2) There is no external language model, which means that the benchmark is not really reflective of industry use.

Regarding (1), we essentially need to be able to, for a single waveform, feed only N audio samples at a time, instead of the whole thing at once, in order to, reflect the "single stream" and "server" scenarios correctly. Normally, in production, we send either at least 10ms or 30ms (or some integer multiple of those) of new audio samples at once. Secondly, we need to account for time usage correctly within MLPerf, while having reasonable performance optimizations (for example, in SingleStream mode, if you finish processing 30ms of data in 10ms, the thread shouldn't sleep or busy wait for 20ms. The next 30ms of data should simply be fed in.) I imagine this is quite hard to do right.

Regarding (2), work for The People's Speech Dataset here has built a tensorflow-based CTC model with an N-gram external language model here, but it is suitable only for the offline use case. I'm actually not 100% sure how to integrate an RNN-T model with an external language model (you should ask me if you're curious about why this is not straightforward). It is my personal belief that n-gram external language models and CTC models are more reflective of industry use cases than RNN-T (note that I showed up after the decision to move forward on RNN-T was made), so I would seriously recommend dropping RNN-T in favor of CTC if no one figures out how to integrate RNN-T with an external language model.

Overall, help wanted to make the ASR benchmark truly reflectivel of industry for v1.0.

TzurV commented 3 years ago

I’m not sure there is industrial standard for benchmarking streaming ASR. There are 4 measurements I have used in the past:

  1. Final WER- the same as currently used and treating the system as offline.
  2. Words Latency – The latency is measured from the time the last frame of a word was provided to the recognition engine until a word (correct or incorrect word) is reported in the output. Reference words timing (can be created by reference alignment) is needed. Maximum and average latency can be reported here.
  3. Output Stability – in streaming mode the ASR output words can be altered, especially when LM is part of the recognizer. Stability measures how often previous words were changed when new word is added to the output.
  4. Turnaround time: from the moment the engine completed the recognition and is ready to accept next audio. In this period buffers and history is cleared. This impacts the real through output of ASR solution. Clearly measuring #2 & #3 are not trivial as Final WER (#1). The basis for the analysis is extra logging of ASR engine that prints out for every change in the output (last word changed or added word) a log like: WordA WordA WordB WordA WordC WordD WordA WordC WordE

Clearly that for some ASR solutions like Connectionist Temporal Classification (CTC) with simple decoding measures like Output Stability or Turnaround time are less relevant.

DilipSequeira commented 3 years ago

From discussion this morning... I think for 1.0 we could reasonably consider changes to loadgen to facilitate streaming inference.

If we're doing that, we need to consider how to benchmark streaming inference, and one option is "how many streams can you sustain?" which would lead us towards Multistream as the best match for streaming inference among the available scenarios (every 10ms you N 10ms slices, and the maximum N for which the SUT can keep up with real time is the metric.)

Running one stream at a time might make sense in mobile, but I think even among the edge devices there's almost nothing for which running a single RNN-T stream makes sense.

TzurV commented 3 years ago

The task can be simplified by assuming N streaming run in parallel on the same machine are all equal and WER (#1 from above) performance is the same.
For the below suggestion I assume that the tested audios are short up to several minutes each. The test should be long enough, half-hour at least.
What is needed then is - a. Randomize files list. each stream should have the files in different order. b. Make sure that all N jobs are busy all the time. c. Performance can be measured just on one stream process (#1-#4 from above). d. Repeat test.

galv commented 3 years ago

Here's a dumb of an old document that someone people in MLPerf collaborated on about a year ago when discussions around ASR first began, for future reference:

Latency

The purpose of this section is to define “latency” for the MLPerf Inference suite which will naturally lead to the definition of sample and query. There are at least three different possible latency metrics. Once a decision has been made suitable latency constraints and tail latencies need to be decided upon for the different MLPerf Inference scenarios.

Metrics

Per-chunk Latency (PCL)

Preprocessing is applied to the audio input. The resulting sequence is divided into fixed-sized “chunks”. Depending on the scenario a query to the SUT will contain one or more chunks (from one or more different audio inputs). The latency is defined as the time taken to process each query. Pros

Cons

Per-word Latency (PWL)

Defined as the latency of a word being output after all audio for that word has been input/transmitted to the system under test (SUT).

Pros

Cons

Real-time Factor (RTF)

Pros

Cons

christ1ne commented 3 years ago

Shall we decide on what datacenter/edge use case(s) we want to target, then find a the suitable measurement methodology? E.g. https://www.ibm.com/cloud/learn/speech-recognition#toc-speech-rec-GaPrtSxU

christ1ne commented 3 years ago

Dilip will be the PoC to see if people on this topic would like to have a more focused discussion

christ1ne commented 3 years ago

Overall I still don't understand why the server measurement is not adequate for speech recognition. I'm assuming we are targeting a use case like Alexa. Currently Alexa speech samples are uploaded to the servers to process. Each sample is a sentence like 'Alexa turn on the living room light', which kind of matches the current sample definition of <15 sec speech sample used in RNNT. The server measurement seems fine to me. Can somehow elaborate on the proposed change to the measurement methodology?

christ1ne commented 3 years ago

still WIP

christ1ne commented 3 years ago

moving to backlog for v1.1