Open mrfragger opened 1 year ago
In the process of correcting the ones and here's an instance where "the sea and moon, more together than" stays on screen for an hour hiding the new subtitles. No music or anything so no idea what caused it.
02:42:11.810 --> 02:42:13.840 here, with nothing but
02:42:13.840 --> 03:25:12.370
the sea and moon, more together than
03:25:12.370 --> 03:33:12.370
in that crowd, or even in my rooms.
03:33:12.370 --> 02:42:20.800
Don't you understand that?"
02:42:21.600 --> 02:42:23.430 "I don't understand anything," she
02:42:23.430 --> 02:42:25.820 said with decision, determined to
to fix changed to 02:42:11.810 --> 02:3:13.840 here, with nothing but
02:42:13.840 --> 02:42:15.370
the sea and moon, more together than
02:42:15.370 --> 02:42:18.370
in that crowd, or even in my rooms.
02:42:18.370 --> 02:42:20.800
Don't you understand that?"
02:42:21.600 --> 02:42:23.430 "I don't understand anything," she
Ok tried many combinations with max length and with split on word and without split on word.
It definitely is some calculation bug with max-length. Well have to use it as it sometimes goes way over 100 characters. So just have to scan for the problem ones and fix it manually. Also it only occurs on about 2% of the ones I've done so hard to say what exactly is the culprit. Once it was music but many other problematic ones didn't have music at all. It seems without max-length it sets it to around 100 characters.
maxlength80sow-example.vtt Line# 3126: 00:56 55 Line# 8754: 02:34 41 Line# 8757: 03:41 34 maxlength90sow-example.vtt Line# 6450: 02:34 41 Line# 6453: 03:41 34 nomaxlength-example.vtt (this one had no timing issues with subs) nosplitonword-example.vtt Line# 3195: 00:56 55 Line# 8916: 02:34 41 Line# 8919: 03:41 34 splitonword-example.vtt Line# 3162: 00:56 55 Line# 8829: 02:34 41 Line# 8832: 03:41 34
I tried looking at some much code but most is way over my head
// 500 -> 00:05.000
// 6000 -> 01:00.000
static std::string to_timestamp(int64_t t, bool comma = false) {
int64_t msec = t * 10;
int64_t hr = msec / (1000 * 60 * 60);
msec = msec - hr * (1000 * 60 * 60);
int64_t min = msec / (1000 * 60);
msec = msec - min * (1000 * 60);
int64_t sec = msec / 1000;
msec = msec - sec * 1000;
char buf[32];
snprintf(buf, sizeof(buf), "%02d:%02d:%02d%s%03d", (int) hr, (int) min, (int) sec, comma ? "," : ".", (int) msec);
return std::string(buf);
}
This part is from openai-whisper and seems to indicate word timestamps are required when using max length. However I don't want vtt or srt subtitles with word-timestamps as it significantly increases file size. Definitely useful for karaoke or language learning I suppose.
parser.add_argument("--max_line_width", type=optional_int, default=None, help="(requires --word_timestamps True) the maximum number of characters in a line before breaking the line")
parser.add_argument("--max_line_count", type=optional_int, default=None, help="(requires --word_timestamps True) the maximum number of lines in a segment")
Here's an example of one corrected timecodes are in ( )
02:34:48.430 --> 02:34:49.280 xxxxxxxx xx xxx
02:34:49.280 --> 03:41:15.260 (02:34:54.260) xxxxxxx xxxxx xxxxxxx xx xxxx xxxxxx xxxxx xx xxx xxxxxxxxx xxxx xxx xxxxxx
(02:34:54.260) 03:41:15.260 --> 02:34:55.120 xxx xxxxx xx xxxxxxxxx
02:34:55.120 --> 02:34:59.840 xx xxxxx x xxxxxx xxxxxxxxxx xxxxxxxx xxxx xx xxx xxxxxx
These are just random notes of code I was looking at but like I said...over my ability
think this one is from openai-whisper if I remember correctly
condition_on_previous_text: bool
if True, the previous output of the model is provided as a prompt for the next window;
disabling may make the text inconsistent across windows, but the model becomes less prone to
getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.
https://github.com/openai/whisper/blob/main/whisper/transcribe.py
consecutive = torch.where(timestamp_tokens[:-1] & timestamp_tokens[1:])[0]
consecutive.add_(1)
if len(consecutive) > 0:
# if the output contains two consecutive timestamp tokens
slices = consecutive.tolist()
if single_timestamp_ending:
slices.append(len(tokens))
last_slice = 0
for current_slice in slices:
sliced_tokens = tokens[last_slice:current_slice]
start_timestamp_pos = (
sliced_tokens[0].item() - tokenizer.timestamp_begin
)
end_timestamp_pos = (
sliced_tokens[-1].item() - tokenizer.timestamp_begin
)
current_segments.append(
new_segment(
start=time_offset + start_timestamp_pos * time_precision,
end=time_offset + end_timestamp_pos * time_precision,
tokens=sliced_tokens,
result=result,
)
)
last_slice = current_slice
if single_timestamp_ending:
# single timestamp at the end means no speech after the last timestamp.
seek += segment_size
else:
# otherwise, ignore the unfinished segment and seek to the last timestamp
last_timestamp_pos = (
tokens[last_slice - 1].item() - tokenizer.timestamp_begin
)
seek += last_timestamp_pos * input_stride
else:
duration = segment_duration
timestamps = tokens[timestamp_tokens.nonzero().flatten()]
if (
len(timestamps) > 0
and timestamps[-1].item() != tokenizer.timestamp_begin
):
# no consecutive timestamps but it has a timestamp; use the last one.
last_timestamp_pos = (
timestamps[-1].item() - tokenizer.timestamp_begin
)
duration = last_timestamp_pos * time_precision
current_segments.append(
new_segment(
start=time_offset,
end=time_offset + duration,
tokens=tokens,
result=result,
)
)
seek += segment_size
https://github.com/openai/whisper/blob/main/whisper/decoding.py
class ApplyTimestampRules(LogitFilter):
def __init__(
self, tokenizer: Tokenizer, sample_begin: int, max_initial_timestamp_index: Optional[int]
):
self.tokenizer = tokenizer
self.sample_begin = sample_begin
self.max_initial_timestamp_index = max_initial_timestamp_index
def apply(self, logits: Tensor, tokens: Tensor):
# suppress <|notimestamps|> which is handled by without_timestamps
if self.tokenizer.no_timestamps is not None:
logits[:, self.tokenizer.no_timestamps] = -np.inf
# timestamps have to appear in pairs, except directly before EOT; mask logits accordingly
for k in range(tokens.shape[0]):
seq = [t for t in tokens[k, self.sample_begin :].tolist()]
last_was_timestamp = len(seq) >= 1 and seq[-1] >= self.tokenizer.timestamp_begin
penultimate_was_timestamp = len(seq) < 2 or seq[-2] >= self.tokenizer.timestamp_begin
if last_was_timestamp:
if penultimate_was_timestamp: # has to be non-timestamp
logits[k, self.tokenizer.timestamp_begin :] = -np.inf
else: # cannot be normal text tokens
logits[k, : self.tokenizer.eot] = -np.inf
# apply the `max_initial_timestamp` option
if tokens.shape[1] == self.sample_begin and self.max_initial_timestamp_index is not None:
last_allowed = self.tokenizer.timestamp_begin + self.max_initial_timestamp_index
logits[:, last_allowed + 1 :] = -np.inf
# if sum of probability over timestamps is above any other token, sample timestamp
logprobs = F.log_softmax(logits.float(), dim=-1)
for k in range(tokens.shape[0]):
timestamp_logprob = logprobs[k, self.tokenizer.timestamp_begin :].logsumexp(dim=-1)
max_text_token_logprob = logprobs[k, : self.tokenizer.timestamp_begin].max()
if timestamp_logprob > max_text_token_logprob:
logits[k, : self.tokenizer.timestamp_begin] = -np.inf
Had this happen yesterday also and found it think it was in a 48 hour audiobook I was doing. This one just happened again today with a 10 hour audiobook. So what happens is
this text remains for 32 mins in subs remains below constantly on for 32 mins and some of the new subtitles that is what can fit show just above it
Obviously to correct this just change
05:14:42.820 --> 05:46:50.320
to05:14:42.820 --> 05:14:50.320
05:46:50.320 --> 05:14:50.420
to05:14:50.320 --> 05:14:50.420
in both cases just changing the xx:46:xx.xxx to xx:14:xx.xxx
my current command to pipe wav max length 78 and split at word
for f in *.opus ; do ffmpeg -i "$f" -f wav -ar 16000 -ac 1 - | ~/whisper/whisper.cpp/./main -m ~/whisper/whisper.cpp/models/ggml-medium.en.bin - -ovtt -of "$f" -l en -ml 78 -sow -t 8 ; for f in *.vtt ; do sed -r -i .bak -e 's|Yellow|yellow|g' -e 's|blue|Blue|g' -e 's|Pink|pink|g' "$f" ; done && for i in *opus.vtt ; do mv -i -- "$i" "$(printf '%s\n' "$i" | sed '1s/.opus.vtt/.vtt/')" ; mkdir vttsubs/ ; mv *.vtt vttsubs/ ; done && rm *.bak ; done
I'll try to figure out an awk script to see if it can automatically check duration on a subtitle line say exceeding 2 mins