ossrs / oryx

Oryx(SRS Stack) is an all-in-one, out-of-the-box, and open-source video solution for creating online video services, including live streaming and WebRTC, on the cloud or through self-hosting.
https://ossrs.io/oryx
MIT License
519 stars 111 forks source link

Transcript: Support distribute crowded words in timeline #163

Closed winlinvip closed 6 months ago

winlinvip commented 6 months ago

Dwayne:

Image

Winlin:

This is not a bug in FFmpeg, but rather, the issue arises because Whisp recognized too many words and did not distribute them evenly throughout the timeline, causing them to accumulate all at once.

Reproduce this issue by this video: https://youtu.be/NONRDS7Rpjg

Image

A 15 segment to reproduce this issue:

https://github.com/ossrs/srs-stack/assets/2777660/c2d88c66-cdb3-41ec-8ccf-8948fe327c7b

This type of interview program is quite common, where multiple people speaking without pauses can lead to the AI recognizing the voice as continuously speaking for over ten seconds.

winlinvip commented 6 months ago

First of all, SRS Stack will write LF when subtitle is too long, for example, if OpenAI whisper response is:

0
00:00:00,550 --> 00:00:15,839
For today's tech check. So tell us about the details of this report. I know Huawei is obviously a very big competitor. Yeah. And that's small but growing. Let's get it that way. But the headline here is counterpart research looked at the first six weeks of smartphone sales in China compared it to a

SRS Stack will convert to:

0
00:00:00,550 --> 00:00:15,839
For today's tech check. So tell us about the
details of this report. I know Huawei is
obviously a very big competitor. Yeah. And
that's small but growing. Let's get it that
way. But the headline here is counterpart
research looked at the first six weeks of
smartphone sales in China compared it to a

It will cause the subtitle very long, bellow is the result:

https://github.com/ossrs/srs-stack/assets/2777660/bb688868-8623-49f1-9045-37badb7fd855

Actually, FFmpeg libass will do the work, so we only need to simply use the output of whipser, bellow is the example:

https://github.com/ossrs/srs-stack/assets/2777660/732f1ad1-5c80-4f20-b35f-48fb3dd4e23b

I think it should fix almost all common cases.

winlinvip commented 6 months ago

Input file:

https://github.com/ossrs/srs-stack/assets/2777660/c2d88c66-cdb3-41ec-8ccf-8948fe327c7b

By FFmpeg:

ffmpeg -i input.mp4 -vf "subtitles=input.srt:force_style='Alignment=2,MarginV=20'" \
    -vcodec libx264 -profile:v main -preset:v medium -tune zerolatency  -bf 0  \
    -acodec aac -copyts -y output.mp4

Sometime, OpenAI whisper response with:

0
00:00:00,550 --> 00:00:15,839
For today's tech check. So tell us about the details of this report. I know Huawei is obviously a very big competitor. Yeah. And that's small but growing. Let's get it that way. But the headline here is counterpart research looked at the first six weeks of smartphone sales in China compared it to a

The result is bellow:

https://github.com/ossrs/srs-stack/assets/2777660/732f1ad1-5c80-4f20-b35f-48fb3dd4e23b

Sometimes, it responses:

0
00:00:00,550 --> 00:00:06,629
For today's tech check. So tell us about the details of this report. I know Huawei is obviously a very big competitor. Yeah. And that's

1
00:00:07,350 --> 00:00:13,829
small but growing. Let's get it that way. But the headline here is counterpart research looked at the first six weeks of smartphone

2
00:00:13,829 --> 00:00:15,789
sales in China compared it to. 

The result is bellow:

https://github.com/ossrs/srs-stack/assets/2777660/30612986-f3b6-4fc7-9a44-4f5b969afd57

In most situations, OpenAI Whisper will generate multiple subtitles. If it doesn't, we might have to create them ourselves, which could be risky due to the potential for introducing bugs. Therefore, I would avoid doing this unless absolutely necessary.

winlinvip commented 6 months ago

Also add a Segments parameters in Fix Queue:

image

User can clear the subtitle if the subtitle is too long.

image

Also show the data in overlay queue.