meronym / speaker-transcription

Transcription with speaker diarization pipeline
MIT License
83 stars 17 forks source link

Allow specifying initial_prompt for transcription #2

Closed johnislarry closed 1 year ago

johnislarry commented 1 year ago

The initial_prompt passed to whisper for each audio segment is hardcoded to None:

https://github.com/meronym/speaker-transcription/blob/master/predict.py#L105

It would be really great if we could provide a prompt via the replicate api input to use for all transcriptions.

I've seen that transcribing audio with proper nouns (in particular names) doesn't work well unless a prompt is provided that spells them out.

What do you think? If you're busy I could put up a PR, however I don't have a good setup to test an actual inference.

meronym commented 1 year ago

@johnislarry a PR would be great! I'll test it and push an update to Replicate over the week-end.

johnislarry commented 1 year ago

@meronym just put up a PR!

https://github.com/meronym/speaker-transcription/pull/3

Let me know what you think when you get a chance

arnab commented 1 year ago

I just came to ask for the same feature (mechanism to pass an input_prompt to whisper). Thanks for creating the issue and the PR, @johnislarry.

meronym commented 1 year ago

Merged #3 :tada:

I'll add a note here FYI @johnislarry @arnab - the mechanism by which Whisper handles the input prompt internally is to use it as context for the first transcription frame. This pipeline runs an independent transcription for each detected speaker segment, so for long compact speech segments I would expect the attention paid by the model to this prompt would diminish quite a bit by the end of the segment.

If this becomes a limitation, we could look into splitting large segments in smaller chunks (to make sure the input prompt is injected often enough), but this might impact in-paragraph coherence. Feel free to open another issue if you find a better way to deal with the input_prompt.

In any case, I suspect most out-of-the-box Whisper implementations probably suffer from the same 'vanishing attention' problem with the prompt.

arnab commented 1 year ago

This pipeline runs an independent transcription for each detected speaker segment, so for long compact speech segments I would expect the attention paid by the model to this prompt would diminish quite a bit by the end of the segment.

Thanks for the details. Does this pipeline feed the previous segment as context/prompt for the subsequent ones? Otherwise, do you think it would improve the transcription quality of long form audio by doing something like that?