m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.25k stars 1.18k forks source link

Transcribe numbers literally? #300

Open Thresher12 opened 1 year ago

Thresher12 commented 1 year ago

I am doing some dataset collection for voice cloning work and as I understand it, (and correct me if I'm wrong) numbers in the transcript should be transcribed as words rather than the digits whisperx transcribes them by default.

In regular whisper I can do this via this code

    from whisper.tokenizer import get_tokenizer

    #encourage model to transcribe words literally
    tokenizer = get_tokenizer(multilingual=True)  # use multilingual=True if using multilingual model
    number_tokens = [
        i
        for i in range(tokenizer.eot)
        if all(c in "0123456789" for c in tokenizer.decode(i).removeprefix(" "))
    ]

    result = model.transcribe(loadedAudioFile, suppress_tokens=[-1] + number_tokens)

But this doesn't work for whisperx. Is there a way to do this in whisperx? Note a straightforward post transcription conversion package is not optimal since it will not convert ordinals properly like 13th and may not convert a number like 2022 properly which can be spoken different ways like 'two thousand twenty two" or 'twenty twenty two".

m-bain commented 1 year ago

Oh I didnt think of token suppresion, that's big brain, nice idea, you can pass in suppress_tokens when loading the model

import whisperx
model = whisperx.load_model("medium.en", asr_options={"suppress_tokens": number_tokens})
model.transcribe...
Thresher12 commented 1 year ago

Oh I didnt think of token suppresion, that's big brain, nice idea, you can pass in suppress_tokens when loading the model

import whisperx
model = whisperx.load_model("medium.en", asr_options={"suppress_tokens": number_tokens})
model.transcribe...

So uh...can you dumb it down for me a little bit? How do you represent number_tokens since my previous whisper definition does not work and the below does not seem to work either.


import whisperx

device = "cuda"
compute_type = "float16"  # change to "int8" if low on GPU mem (may reduce accuracy)
number_tokens = ['0','1','2','3','4','5','6','7','8','9']

#list files in directory
filelist = os.listdir('./Audio/')
model = whisperx.load_model("large-v2", device, compute_type=compute_type, language='en', asr_options={"suppress_tokens": number_tokens})
for entry in filelist:
    SpeechTranscriber.SpeechTranscriber(entry, model)

EDIT: So I dug through the code and maybe my pythonfu isn't that great but while I can see the suppress_tokens argument. But I don't see where the functionality is. You just pass a value to it and I don't see where it does anything.

AdolfVonKleist commented 1 year ago

@Thresher12 I think this has been accidentally suppressed in WhisperX. I was considering creating a pull request for it, but if you want to hack this behaviour you need to modify the code in the following places:

to some thing like the below - note this just a hacky example, I'm also stripping other punctuation because things like $ and % are often transcribed as $$$ or similar even though the verbalization in the audio might be "fifty five dollars"; and also note that the below is only being applied when the language (and tokenizer) have been specified. You'd need to make some other changes to make it more general:

    suppress = []
    if language is not None:
        tokenizer = faster_whisper.tokenizer.Tokenizer(
            model.hf_tokenizer,
            model.model.is_multilingual,
            task=task, language=language
        )
        suppress = [
            i for i in range(tokenizer.eot)
            if all(c in "0123456789@#%&*+=_$:-.,?!" for c \
                   in tokenizer.decode([i]).removeprefix(" "))
        ]
    else:
        print("No language specified, language will be first be detected for each audio file (increases inference time).")
        tokenizer = None

    default_asr_options =  {
        "beam_size": 5,
        "best_of": 5,
        "patience": 1,
        "length_penalty": 1,
        "temperatures": [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
        "compression_ratio_threshold": 2.4,
        "log_prob_threshold": -1.0,
        "no_speech_threshold": 0.6,
        "condition_on_previous_text": False,
        "initial_prompt": None,
        "prefix": None,
        "suppress_blank": True,
        "suppress_tokens": [-1] + suppress,    # <-------- Important!!!
        "without_timestamps": True,
        "max_initial_timestamp": 0.0,
        "word_timestamps": False,
        "prepend_punctuations": "",
        "append_punctuations": ""
    }

Finally, you will also need to unquote this line :

after making these changes directly, all transcriptions should provide alternative transcriptions with appropriate verablization. I found this extremely useful for creating new corpora for alternative system training or adaptation. It can also be used to train a dedicated 'denormalization' model if you decode a large enough corpus in both modes.

I did not notice any significant degradations in accuracy, and if you use this with the word alignment module as a 'check' you can filter out low-confidence results as well. Hope it helps.

byteneumann commented 1 year ago

I think the modification of default_asr_options is not required. Passing asr_options works because line 74 takes the passed dict.

The second patch, on the other hand, was the solution for me! Thanks for the hint @AdolfVonKleist .

Maybe it would be sufficient to uncomment all lines taking refering to options, i.e. length_penalty, suppress_blank, suppress_tokens, and max_initial_timestamp_index (calculated from an optional value)?

AdolfVonKleist commented 1 year ago

@byteneumann great!

I think the modification of default_asr_options is not required. Passing asr_options works because line 74 takes the passed dict.

are you sure about this? I probably missed something here, but I couldn't find where the susppress_tokens key-val pair is actually passed to the asr_options dict. It seems to be omitted:

also if you use the suppress tokens option directly I think you have to provide/compute the token IDs, rather than just provide the target characters themselves.

I agree that maybe all of these options could be re-exposed, but maybe there was a reason for commenting them out initially?

m-bain commented 1 year ago

Added this feature in the newest commit (https://github.com/m-bain/whisperX/commit/0c84c26d9272fd8dc41af7945b8e9ee6a8820462), it works well. I changed the logic to this though:

def find_numeral_symbol_tokens(tokenizer):
    numeral_symbol_tokens = []
    for i in range(tokenizer.eot):
        token = tokenizer.decode([i]).removeprefix(" ")
        has_numeral_symbol = any(c in "0123456789%$£" for c in token)
        if has_numeral_symbol:
            numeral_symbol_tokens.append(i)
    return numeral_symbol_tokens

Please play around with different suppression logic for numbers / symbols and let me know what you find to be most successful

AdolfVonKleist commented 1 year ago

@m-bain awesome! My only suggestion would be to maybe make the character selection available to the user directly "0123456789%$£" so that users can easily feed a list of characters to suppress. My use case might be a bit different, but for instance, I am suppressing most common punctuation as well (including . since it makes a mess of things like spoken email addresses and has too many ambiguous verbalizations for my purposes), and then resegmenting based on the alignment results. This results in overlong segments for general usage since sentence segmentation is left with little information, but maybe it would be handy to just be able to supply that list of characters? In any case thanks for the quick update and addition!

m-bain commented 1 year ago

@AdolfVonKleist makes sense, ill add the custom character suppression. I tried suppressing with . but then it tends to not punctuate the text which proves difficult for splitting the text into sentences and such.

AdolfVonKleist commented 1 year ago

then it tends to not punctuate the text

@m-bain yeah this is what I meant by "...results in overlong segments". However my use case is focused on automatic generation of training corpora from un-annotated recordings. I'm using the word confidence scores to extract "reliable" 2s-20s segments from long stereo conversations, to quickly generate traditional training corpora, so I don't use the default segmentation, but rather create my own from a combination of the alignment-generated confidence scores and inter-word silences.

Two further anecdotal observations, which are probably already obvious but were not so obvious to me at first:

arnavmehta7 commented 1 year ago

With this suppress_numerals recent commit, the number of hallucinations have significantly improved in my testings.

thisisandreeeee commented 1 year ago

What happens if I actually want numbers transcribed as digits instead of words?

RomanLeo2003 commented 5 months ago

With this suppress_numerals recent commit, the number of hallucinations have significantly improved in my testings.

Did You mean the number of hallucinations decreased or increased?

arnavmehta7 commented 5 months ago

Sorry. I meant increased

timotheecour commented 4 months ago

looks like addressed via https://github.com/m-bain/whisperX/pull/303