Open Thresher12 opened 1 year ago
Oh I didnt think of token suppresion, that's big brain, nice idea, you can pass in suppress_tokens when loading the model
import whisperx
model = whisperx.load_model("medium.en", asr_options={"suppress_tokens": number_tokens})
model.transcribe...
Oh I didnt think of token suppresion, that's big brain, nice idea, you can pass in suppress_tokens when loading the model
import whisperx model = whisperx.load_model("medium.en", asr_options={"suppress_tokens": number_tokens}) model.transcribe...
So uh...can you dumb it down for me a little bit? How do you represent number_tokens since my previous whisper definition does not work and the below does not seem to work either.
import whisperx
device = "cuda"
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
number_tokens = ['0','1','2','3','4','5','6','7','8','9']
#list files in directory
filelist = os.listdir('./Audio/')
model = whisperx.load_model("large-v2", device, compute_type=compute_type, language='en', asr_options={"suppress_tokens": number_tokens})
for entry in filelist:
SpeechTranscriber.SpeechTranscriber(entry, model)
EDIT: So I dug through the code and maybe my pythonfu isn't that great but while I can see the suppress_tokens argument. But I don't see where the functionality is. You just pass a value to it and I don't see where it does anything.
@Thresher12 I think this has been accidentally suppressed in WhisperX. I was considering creating a pull request for it, but if you want to hack this behaviour you need to modify the code in the following places:
to some thing like the below - note this just a hacky example, I'm also stripping other punctuation because things like $
and %
are often transcribed as $$$
or similar even though the verbalization in the audio might be "fifty five dollars"; and also note that the below is only being applied when the language (and tokenizer) have been specified. You'd need to make some other changes to make it more general:
suppress = []
if language is not None:
tokenizer = faster_whisper.tokenizer.Tokenizer(
model.hf_tokenizer,
model.model.is_multilingual,
task=task, language=language
)
suppress = [
i for i in range(tokenizer.eot)
if all(c in "0123456789@#%&*+=_$:-.,?!" for c \
in tokenizer.decode([i]).removeprefix(" "))
]
else:
print("No language specified, language will be first be detected for each audio file (increases inference time).")
tokenizer = None
default_asr_options = {
"beam_size": 5,
"best_of": 5,
"patience": 1,
"length_penalty": 1,
"temperatures": [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
"compression_ratio_threshold": 2.4,
"log_prob_threshold": -1.0,
"no_speech_threshold": 0.6,
"condition_on_previous_text": False,
"initial_prompt": None,
"prefix": None,
"suppress_blank": True,
"suppress_tokens": [-1] + suppress, # <-------- Important!!!
"without_timestamps": True,
"max_initial_timestamp": 0.0,
"word_timestamps": False,
"prepend_punctuations": "",
"append_punctuations": ""
}
Finally, you will also need to unquote this line :
result = self.model.generate(
encoder_output,
[prompt] * batch_size,
suppress_tokens=options.suppress_tokens, # <--- Need to be included and not commented!
# length_penalty=options.length_penalty,
# max_length=self.max_length,
# return_scores=True,
# return_no_speech_prob=True,
# suppress_blank=options.suppress_blank,
# suppress_tokens=options.suppress_tokens,
# max_initial_timestamp_index=max_initial_timestamp_index,
)
after making these changes directly, all transcriptions should provide alternative transcriptions with appropriate verablization. I found this extremely useful for creating new corpora for alternative system training or adaptation. It can also be used to train a dedicated 'denormalization' model if you decode a large enough corpus in both modes.
I did not notice any significant degradations in accuracy, and if you use this with the word alignment module as a 'check' you can filter out low-confidence results as well. Hope it helps.
I think the modification of default_asr_options
is not required. Passing asr_options
works because line 74 takes the passed dict.
The second patch, on the other hand, was the solution for me! Thanks for the hint @AdolfVonKleist .
Maybe it would be sufficient to uncomment all lines taking refering to options
, i.e. length_penalty
, suppress_blank
, suppress_tokens
, and max_initial_timestamp_index
(calculated from an optional value)?
@byteneumann great!
I think the modification of default_asr_options is not required. Passing asr_options works because line 74 takes the passed dict.
are you sure about this? I probably missed something here, but I couldn't find where the susppress_tokens
key-val pair is actually passed to the asr_options
dict. It seems to be omitted:
also if you use the suppress tokens option directly I think you have to provide/compute the token IDs, rather than just provide the target characters themselves.
I agree that maybe all of these options could be re-exposed, but maybe there was a reason for commenting them out initially?
Added this feature in the newest commit (https://github.com/m-bain/whisperX/commit/0c84c26d9272fd8dc41af7945b8e9ee6a8820462), it works well. I changed the logic to this though:
def find_numeral_symbol_tokens(tokenizer):
numeral_symbol_tokens = []
for i in range(tokenizer.eot):
token = tokenizer.decode([i]).removeprefix(" ")
has_numeral_symbol = any(c in "0123456789%$£" for c in token)
if has_numeral_symbol:
numeral_symbol_tokens.append(i)
return numeral_symbol_tokens
Please play around with different suppression logic for numbers / symbols and let me know what you find to be most successful
@m-bain awesome! My only suggestion would be to maybe make the character selection available to the user directly "0123456789%$£"
so that users can easily feed a list of characters to suppress. My use case might be a bit different, but for instance, I am suppressing most common punctuation as well (including .
since it makes a mess of things like spoken email addresses and has too many ambiguous verbalizations for my purposes), and then resegmenting based on the alignment results. This results in overlong segments for general usage since sentence segmentation is left with little information, but maybe it would be handy to just be able to supply that list of characters? In any case thanks for the quick update and addition!
@AdolfVonKleist makes sense, ill add the custom character suppression. I tried suppressing with .
but then it tends to not punctuate the text which proves difficult for splitting the text into sentences and such.
then it tends to not punctuate the text
@m-bain yeah this is what I meant by "...results in overlong segments". However my use case is focused on automatic generation of training corpora from un-annotated recordings. I'm using the word confidence scores to extract "reliable" 2s-20s segments from long stereo conversations, to quickly generate traditional training corpora, so I don't use the default segmentation, but rather create my own from a combination of the alignment-generated confidence scores and inter-word silences.
Two further anecdotal observations, which are probably already obvious but were not so obvious to me at first:
--batch_size=32
and --compute_type=float16
. However short utterances, including when they are provided as a list don't benefit from this. Running a bunch of 15s IVR utterances through the system is still slow since the chunks are not currently batched.
With this suppress_numerals recent commit, the number of hallucinations have significantly improved in my testings.
What happens if I actually want numbers transcribed as digits instead of words?
With this suppress_numerals recent commit, the number of hallucinations have significantly improved in my testings.
Did You mean the number of hallucinations decreased or increased?
Sorry. I meant increased
looks like addressed via https://github.com/m-bain/whisperX/pull/303
I am doing some dataset collection for voice cloning work and as I understand it, (and correct me if I'm wrong) numbers in the transcript should be transcribed as words rather than the digits whisperx transcribes them by default.
In regular whisper I can do this via this code
But this doesn't work for whisperx. Is there a way to do this in whisperx? Note a straightforward post transcription conversion package is not optimal since it will not convert ordinals properly like 13th and may not convert a number like 2022 properly which can be spoken different ways like 'two thousand twenty two" or 'twenty twenty two".