ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
34.53k stars 3.52k forks source link

subtitle line remains stuck for 30 mins | awk script > 2mins length #975

Open mrfragger opened 1 year ago

mrfragger commented 1 year ago
05:14:41.490 --> 05:14:42.820
any possible chance

05:14:42.820 --> 05:46:50.320
this text remains for 32 mins in subs

05:46:50.320 --> 05:14:50.420
which is correct

05:14:50.420 --> 05:14:55.710
and transcription is correct

05:14:55.710 --> 05:14:57.460
but it runs over 

05:14:57.460 --> 05:15:03.590
so gotta come up with a sed or awk script

05:15:03.590 --> 05:15:05.300
to detect if say subtitle duration

05:15:05.300 --> 05:15:11.220
exceeds 2 mins let's say

Had this happen yesterday also and found it think it was in a 48 hour audiobook I was doing. This one just happened again today with a 10 hour audiobook. So what happens is

this text remains for 32 mins in subs remains below constantly on for 32 mins and some of the new subtitles that is what can fit show just above it

Obviously to correct this just change 05:14:42.820 --> 05:46:50.320 to 05:14:42.820 --> 05:14:50.320

05:46:50.320 --> 05:14:50.420 to 05:14:50.320 --> 05:14:50.420

in both cases just changing the xx:46:xx.xxx to xx:14:xx.xxx

my current command to pipe wav max length 78 and split at word for f in *.opus ; do ffmpeg -i "$f" -f wav -ar 16000 -ac 1 - | ~/whisper/whisper.cpp/./main -m ~/whisper/whisper.cpp/models/ggml-medium.en.bin - -ovtt -of "$f" -l en -ml 78 -sow -t 8 ; for f in *.vtt ; do sed -r -i .bak -e 's|Yellow|yellow|g' -e 's|blue|Blue|g' -e 's|Pink|pink|g' "$f" ; done && for i in *opus.vtt ; do mv -i -- "$i" "$(printf '%s\n' "$i" | sed '1s/.opus.vtt/.vtt/')" ; mkdir vttsubs/ ; mv *.vtt vttsubs/ ; done && rm *.bak ; done

I'll try to figure out an awk script to see if it can automatically check duration on a subtitle line say exceeding 2 mins

mrfragger commented 1 year ago

In the process of correcting the ones and here's an instance where "the sea and moon, more together than" stays on screen for an hour hiding the new subtitles. No music or anything so no idea what caused it.

02:42:11.810 --> 02:42:13.840 here, with nothing but

02:42:13.840 --> 03:25:12.370 the sea and moon, more together than

03:25:12.370 --> 03:33:12.370 in that crowd, or even in my rooms.

03:33:12.370 --> 02:42:20.800 Don't you understand that?"

02:42:21.600 --> 02:42:23.430 "I don't understand anything," she

02:42:23.430 --> 02:42:25.820 said with decision, determined to

to fix changed to 02:42:11.810 --> 02:3:13.840 here, with nothing but

02:42:13.840 --> 02:42:15.370 the sea and moon, more together than

02:42:15.370 --> 02:42:18.370 in that crowd, or even in my rooms.

02:42:18.370 --> 02:42:20.800 Don't you understand that?"

02:42:21.600 --> 02:42:23.430 "I don't understand anything," she

mrfragger commented 1 year ago

Ok tried many combinations with max length and with split on word and without split on word.

It definitely is some calculation bug with max-length. Well have to use it as it sometimes goes way over 100 characters. So just have to scan for the problem ones and fix it manually. Also it only occurs on about 2% of the ones I've done so hard to say what exactly is the culprit. Once it was music but many other problematic ones didn't have music at all. It seems without max-length it sets it to around 100 characters.

maxlength80sow-example.vtt Line# 3126: 00:56 55 Line# 8754: 02:34 41 Line# 8757: 03:41 34 maxlength90sow-example.vtt Line# 6450: 02:34 41 Line# 6453: 03:41 34 nomaxlength-example.vtt (this one had no timing issues with subs) nosplitonword-example.vtt Line# 3195: 00:56 55 Line# 8916: 02:34 41 Line# 8919: 03:41 34 splitonword-example.vtt Line# 3162: 00:56 55 Line# 8829: 02:34 41 Line# 8832: 03:41 34

I tried looking at some much code but most is way over my head

//  500 -> 00:05.000
// 6000 -> 01:00.000
static std::string to_timestamp(int64_t t, bool comma = false) {
    int64_t msec = t * 10;
    int64_t hr = msec / (1000 * 60 * 60);
    msec = msec - hr * (1000 * 60 * 60);
    int64_t min = msec / (1000 * 60);
    msec = msec - min * (1000 * 60);
    int64_t sec = msec / 1000;
    msec = msec - sec * 1000;

    char buf[32];
    snprintf(buf, sizeof(buf), "%02d:%02d:%02d%s%03d", (int) hr, (int) min, (int) sec, comma ? "," : ".", (int) msec);

    return std::string(buf);
}

This part is from openai-whisper and seems to indicate word timestamps are required when using max length. However I don't want vtt or srt subtitles with word-timestamps as it significantly increases file size. Definitely useful for karaoke or language learning I suppose.

parser.add_argument("--max_line_width", type=optional_int, default=None, help="(requires --word_timestamps True) the maximum number of characters in a line before breaking the line")

parser.add_argument("--max_line_count", type=optional_int, default=None, help="(requires --word_timestamps True) the maximum number of lines in a segment")

Here's an example of one corrected timecodes are in ( )

02:34:48.430 --> 02:34:49.280 xxxxxxxx xx xxx

02:34:49.280 --> 03:41:15.260 (02:34:54.260) xxxxxxx xxxxx xxxxxxx xx xxxx xxxxxx xxxxx xx xxx xxxxxxxxx xxxx xxx xxxxxx

(02:34:54.260) 03:41:15.260 --> 02:34:55.120 xxx xxxxx xx xxxxxxxxx

02:34:55.120 --> 02:34:59.840 xx xxxxx x xxxxxx xxxxxxxxxx xxxxxxxx xxxx xx xxx xxxxxx

mrfragger commented 1 year ago

These are just random notes of code I was looking at but like I said...over my ability

think this one is from openai-whisper if I remember correctly

condition_on_previous_text: bool
        if True, the previous output of the model is provided as a prompt for the next window;
        disabling may make the text inconsistent across windows, but the model becomes less prone to
        getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.

https://github.com/openai/whisper/blob/main/whisper/transcribe.py


 consecutive = torch.where(timestamp_tokens[:-1] & timestamp_tokens[1:])[0]
            consecutive.add_(1)
            if len(consecutive) > 0:
                # if the output contains two consecutive timestamp tokens
                slices = consecutive.tolist()
                if single_timestamp_ending:
                    slices.append(len(tokens))

                last_slice = 0
                for current_slice in slices:
                    sliced_tokens = tokens[last_slice:current_slice]
                    start_timestamp_pos = (
                        sliced_tokens[0].item() - tokenizer.timestamp_begin
                    )
                    end_timestamp_pos = (
                        sliced_tokens[-1].item() - tokenizer.timestamp_begin
                    )
                    current_segments.append(
                        new_segment(
                            start=time_offset + start_timestamp_pos * time_precision,
                            end=time_offset + end_timestamp_pos * time_precision,
                            tokens=sliced_tokens,
                            result=result,
                        )
                    )
                    last_slice = current_slice

                if single_timestamp_ending:
                    # single timestamp at the end means no speech after the last timestamp.
                    seek += segment_size
                else:
                    # otherwise, ignore the unfinished segment and seek to the last timestamp
                    last_timestamp_pos = (
                        tokens[last_slice - 1].item() - tokenizer.timestamp_begin
                    )
                    seek += last_timestamp_pos * input_stride
            else:
                duration = segment_duration
                timestamps = tokens[timestamp_tokens.nonzero().flatten()]
                if (
                    len(timestamps) > 0
                    and timestamps[-1].item() != tokenizer.timestamp_begin
                ):
                    # no consecutive timestamps but it has a timestamp; use the last one.
                    last_timestamp_pos = (
                        timestamps[-1].item() - tokenizer.timestamp_begin
                    )
                    duration = last_timestamp_pos * time_precision

                current_segments.append(
                    new_segment(
                        start=time_offset,
                        end=time_offset + duration,
                        tokens=tokens,
                        result=result,
                    )
                )
                seek += segment_size

https://github.com/openai/whisper/blob/main/whisper/decoding.py

class ApplyTimestampRules(LogitFilter):
    def __init__(
        self, tokenizer: Tokenizer, sample_begin: int, max_initial_timestamp_index: Optional[int]
    ):
        self.tokenizer = tokenizer
        self.sample_begin = sample_begin
        self.max_initial_timestamp_index = max_initial_timestamp_index

    def apply(self, logits: Tensor, tokens: Tensor):
        # suppress <|notimestamps|> which is handled by without_timestamps
        if self.tokenizer.no_timestamps is not None:
            logits[:, self.tokenizer.no_timestamps] = -np.inf

        # timestamps have to appear in pairs, except directly before EOT; mask logits accordingly
        for k in range(tokens.shape[0]):
            seq = [t for t in tokens[k, self.sample_begin :].tolist()]
            last_was_timestamp = len(seq) >= 1 and seq[-1] >= self.tokenizer.timestamp_begin
            penultimate_was_timestamp = len(seq) < 2 or seq[-2] >= self.tokenizer.timestamp_begin

            if last_was_timestamp:
                if penultimate_was_timestamp:  # has to be non-timestamp
                    logits[k, self.tokenizer.timestamp_begin :] = -np.inf
                else:  # cannot be normal text tokens
                    logits[k, : self.tokenizer.eot] = -np.inf

        # apply the `max_initial_timestamp` option
        if tokens.shape[1] == self.sample_begin and self.max_initial_timestamp_index is not None:
            last_allowed = self.tokenizer.timestamp_begin + self.max_initial_timestamp_index
            logits[:, last_allowed + 1 :] = -np.inf

        # if sum of probability over timestamps is above any other token, sample timestamp
        logprobs = F.log_softmax(logits.float(), dim=-1)
        for k in range(tokens.shape[0]):
            timestamp_logprob = logprobs[k, self.tokenizer.timestamp_begin :].logsumexp(dim=-1)
            max_text_token_logprob = logprobs[k, : self.tokenizer.timestamp_begin].max()
            if timestamp_logprob > max_text_token_logprob:
                logits[k, : self.tokenizer.timestamp_begin] = -np.inf