m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.61k stars 1.34k forks source link

Can Hard Coded Hyperparameters be moved to a config file? #872

Closed morsczx closed 2 months ago

morsczx commented 2 months ago

Disclaimer - new to github & had a doubt In whisperx/audio.py - hyperparameters are hardcoded. can or should we move these to a config file so that we can edit these as required? specifically the chunk_length ? if so what file should be made? I can raise a PR basis suggestions.

SAMPLE_RATE = 16000
N_FFT = 400
HOP_LENGTH = 160
CHUNK_LENGTH = 30
N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE  # 480000 samples in a 30-second chunk
N_FRAMES = exact_div(N_SAMPLES, HOP_LENGTH)  # 3000 frames in a mel spectrogram input

N_SAMPLES_PER_TOKEN = HOP_LENGTH * 2  # the initial convolutions has stride 2
FRAMES_PER_SECOND = exact_div(SAMPLE_RATE, HOP_LENGTH)  # 10ms per audio frame
TOKENS_PER_SECOND = exact_div(SAMPLE_RATE, N_SAMPLES_PER_TOKEN)  # 20ms per audio token
kevdawg94 commented 2 months ago

@morsczx did you resolve this issue? I was also trying to fine tune my segmentation performance and wondering whether chunk size would have an impact. Agree that these could be moved to config file.

morsczx commented 2 months ago

@kevdawg94 - no I did not resolve the issue - thought I. might have been wrong in the suggestion. I had a lot of audio clips with 10 second intervals of languages - which were being incorrectly identified - hence wanted control over the chunk size. will take this up

kevdawg94 commented 2 months ago

I found you can already adjust chunk size as follows:

transcription = model.transcribe(audio, batch_size=batch_size, chunk_size=chunk_size)

morsczx commented 2 months ago

gotcha, thanks a lot!