Few domain related terminologies are not transcribed correctly in whisper-triton.

k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi

https://k2-fsa.github.io/sherpa

Apache License 2.0

554 stars 109 forks source link

Few domain related terminologies are not transcribed correctly in whisper-triton. #649

Open krishnardt opened 1 month ago

krishnardt commented 1 month ago

Hi,

Is there any way to correct above mentioned examples while transcribing through whisper-triton?

Model is not able to transcribe few words properly even though spelt normally.

For example: Atomberg is transcribed as "Atombuck".

I tried to add custom tokens to the tokeniser(tiktoken) by modifying its tokenizer.py code as in below image, without disturbing the flow. but I am getting worst output compared to without custom token.

I followed K2 Sherpa's approach to generate the model and ran the triton server.

Can someone guide me how to resolve this issue?

csukuangfj commented 1 month ago

@yuekaizhang Could you have a look?

yuekaizhang commented 1 month ago

Hi @krishnardt, whisper-triton is an acclerated solution which can't improve whisper's accuracy. If you can't get correct results using pytorch whisper implementation, whisper-triton can't help either.

yuekaizhang commented 1 month ago

Try <|startofprev|>Hotwords: Atomberg<|startoftranscript|><|en|><|transcribe|><|notimestamps|> as text prefix to see if it could work.

krishnardt commented 1 month ago

@yuekaizhang I edited in the wrong place.

I got the ouptut correctly...

I have few other hot words.. Added them as comma seperated values. It is working fine.

But won't it increase the latency? Do we have any other way to add hotwords during starting the server instead of inference? because currently , the model is accepting only 30 secs data as input per request.

I tried to for 2 mins data, that means for 4 requests, the prefix would be added and it the hotwords list is bigger, it may increase the latency.

This is what I am thinking. Please correct if I am wrong.