Closed boxabirds closed 1 year ago
Alternative invocation below also fails with the same error. I'm assuming this is because there's repo setup / config that I'm missing.
result = whisper.transcribe_timestamped(model, audio, language="en")
Hard to understand what's happening there. Is this failing?
import whisper
whisper.load_model("tiny")
Apologies that I wasn't clear. the line that throws this error is
result = whisper.transcribe(model, audio, language="en")
Attached is a screenshot of the error
whisper.load_model("tiny")
returns the follwoing model:
Whisper(
(encoder): AudioEncoder(
(conv1): Conv1d(80, 384, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): Conv1d(384, 384, kernel_size=(3,), stride=(2,), padding=(1,))
(blocks): ModuleList(
(0): ResidualAttentionBlock(
(attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=1536, out_features=384, bias=True)
)
(mlp_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
(1): ResidualAttentionBlock(
(attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=1536, out_features=384, bias=True)
)
(mlp_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
(2): ResidualAttentionBlock(
(attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=1536, out_features=384, bias=True)
)
(mlp_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
(3): ResidualAttentionBlock(
(attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=1536, out_features=384, bias=True)
)
(mlp_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
)
(ln_post): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
(decoder): TextDecoder(
(token_embedding): Embedding(51865, 384)
(blocks): ModuleList(
(0): ResidualAttentionBlock(
(attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(cross_attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(cross_attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=1536, out_features=384, bias=True)
)
(mlp_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
(1): ResidualAttentionBlock(
(attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(cross_attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(cross_attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=1536, out_features=384, bias=True)
)
(mlp_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
(2): ResidualAttentionBlock(
(attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(cross_attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(cross_attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=1536, out_features=384, bias=True)
)
(mlp_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
(3): ResidualAttentionBlock(
(attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(cross_attn): MultiHeadAttention(
(query): Linear(in_features=384, out_features=384, bias=True)
(key): Linear(in_features=384, out_features=384, bias=False)
(value): Linear(in_features=384, out_features=384, bias=True)
(out): Linear(in_features=384, out_features=384, bias=True)
)
(cross_attn_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=1536, out_features=384, bias=True)
)
(mlp_ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
)
(ln): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
)
Having the error stack is a bit more informative, but I don't see what's doing the function transcribe_audio
in line 31 of poc.py
. It's weird that the error stack is not going deeper until the real culprit.
@boxabirds any new on this?
What happen if you use package whisper
instead of whisper_timestamped
?
Apologies, I’m travelling — I’ll try to get back to you in the next few days.
On Thu, 2 Feb 2023 at 05:56, Jérôme Louradour @.***> wrote:
@boxabirds https://github.com/boxabirds any new on this?
What happen if you use package whisper instead of whisper_timestamped?
— Reply to this email directly, view it on GitHub https://github.com/linto-ai/whisper-timestamped/issues/23#issuecomment-1412390665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABD62OVR5NPWGPLWMLESILWVKIUXANCNFSM6AAAAAAUJZ6MWY . You are receiving this because you were mentioned.Message ID: @.***>
Closing. Feel free to re-open if a problem specific to whisper_timestamped (i.e. that does not occur with whipser) arises.
Hi this sample code below produces the subsequent error -- what am I doing wrong?