Closed brennanmceachran closed 2 weeks ago
An even more reduced transcripts below.
I believe the {0,4} in the regex pattern is the root cause of the issue. It introduces a problem with short, common words like "a." Instead of matching the first instance of "a" in remainingText, the regex can skip over the first "a" and match the second instance if it falls within the first few characters. This misalignment leads to incorrect truncation, as remainingText is sliced at the wrong location.
For example, in remainingText = "a man", when looking for the word "a", the regex matches the second "a" (in "man"), leaving remainingText = "n", which breaks the expected sequence and triggers a parsing error in the function.
const transcript = {
task: "transcribe",
text: "a man",
words: [
{
start: 1,
end: 2,
word: "a",
},
{
start: 2,
end: 3,
word: "man",
},
],
duration: 3,
language: "english",
};
Or
const transcript = {
task: "transcribe",
text: "i mint",
words: [
{
start: 1,
end: 2,
word: "i",
},
{
start: 2,
end: 3,
word: "mint",
},
],
duration: 3,
language: "english",
};
Thanks for reporting!
🙏 thanks @JonnyBurger
Bug Report 🐛
While using openAiWhisperApiToCaptions, an error is thrown:
The error seems to result from remainingText containing a truncated version of the expected word, leading to mismatches during regex matching.
Reproduction
Reproducible with a truncated transcript received from openAI