mpc001 / Visual_Speech_Recognition_for_Multiple_Languages

Visual Speech Recognition for Multiple Languages
Other
356 stars 56 forks source link

Can lipreading support streaming or online? #9

Closed zyjcsf closed 1 year ago

zyjcsf commented 1 year ago

Hi , I‘m a beginner in lipreading. I'm curious how low the latency of lip recognition can be? Is there any solution to reduce the delay? Thank you very much.

mpc001 commented 1 year ago

Hi, @zyjcsf, it takes around 20 minutes to finish the whole evaluation on the LRS3 test set (0.9 hours) on GPU, so the real time factor (RTF) of our model should not be high. As our lipreading models are trained in a non-streaming scenario, it requires some modifications to the model (train in a streaming fashion) or the evaluation process to achieve prediction in real-time or near real-time. Please check the following steps to use an offline lip-reading model for the purpose of streaming prediction.

  1. Chunk the image sequences. Limit the input to small chunk or segments. The chunk size can vary depending on the model and the latency requirements for your application.
  2. Infer each chunk. Transcribe each chunk sequentially using the offline lipreading model.
  3. Post-processing. Perform any necessary post-processing, such as merging transcriptions and handling overlaps.