hezarai / hezar

The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!
https://hezarai.github.io/hezar/
Apache License 2.0
817 stars 44 forks source link

hezarai/whisper-small-fa Works very well but just for the first 10 seconds #129

Closed rajabit closed 5 months ago

rajabit commented 7 months ago

Hi I was trying to use the speech recognition for a 1hr audio it was working and detecting very well but just for the first 10 seconds of the audio and it just abandoned the rest of the file.

arxyzan commented 7 months ago

Hi @rajabit, thanks for your feedback.

What do you mean by abandoning the rest of the file? does it raise an error? or it just can't recognize the rest of the audio file (the output text is just the transcript of the first 10 seconds )?

Note: Keep in mind that the maximum output tokens for Whisper (all variants) is 448 tokens which would mean somewhere between 3-5 minutes based on my approximation.

rajabit commented 7 months ago

I mean, it works without errors but for any audio file that I test it just gives me a short part of it, maybe it's related to the (448 tokens) limit but I'm pretty sure that the text wasn't that much.

anyway, thanks for the library and your support.

arxyzan commented 7 months ago

@rajabit So this seems like another type of bug since it seems that the model is suffering from an early stopping issue in generation which is likely due to short training samples in the training set. I'll take a look into the training script and parameters to see if the model has seen long samples in the dataset or not. I'll let you know soon.

arxyzan commented 7 months ago

@rajabit Forgot to point out that genrerally, it's better to feed small audio chunks to speech recognition models (whisper, wav2vec, etc.). Currently, there is no utility to do that in Hezar, but I'll open an issue regarding this and hope that we can add it soon. If you have similar experience or are willing to help out, just let me know. Thanks.

arxyzan commented 7 months ago

@rajabit Alright, I just checked the stats and noticed that common voice maximum audio length is 10 seconds, so you'd have to chunk your audio into 10-seconds parts and feed them in batches to the model. The original Whisper model's recommended length is up to 30 seconds. I think it's safe to close this issue and follow the chunk feature in a seperate issue.

rajabit commented 7 months ago

@arxyzan The day that I was testing it, I tried to chunk the audio file but when I did that the result wasn't useful idk why, maybe because when the file splits voices are not clear anymore

arxyzan commented 7 months ago

@rajabit How did you chunk the audio files? You have to make sure that the chunks do not involve split words. You can tackle this by using silence detection techniques using pydub.

rajabit commented 7 months ago

I tried this one, but I'll try your suggestion, Thank you.

def process_audio(file_name):
    myaudio = AudioSegment.from_file(file_name, "mp3")
    chunk_length_ms = 10000  # pydub calculates in millisec
    chunks = make_chunks(myaudio, chunk_length_ms)  # Make chunks of one sec
    string = ""
    for i, chunk in enumerate(chunks):
        chunk_name = './chunked/' + file_name + "_{0}.mp3".format(i)
        print("exporting", chunk_name)
        chunk.export(chunk_name, format="mp3")
        string += (" " + getText(chunk_name))
    return string
arxyzan commented 7 months ago

@rajabit This will probably do the job:

from pydub import AudioSegment
from pydub.silence import split_on_silence

def chunk_audio(file_path):
    audio = AudioSegment.from_file(file_path)
    chunks = split_on_silence(audio, silence_thresh=-40, min_silence_len=1000)
    max_chunk_length = 10000  # 10 seconds

    for i, chunk in enumerate(chunks):
        chunk = chunk[:max_chunk_length] if len(chunk) > max_chunk_length else chunk
        output_filename = f"{file_path.stem}_chunk_{i}.mp3"
        chunk.export(output_filename, format="mp3")

# Example usage:
file_path = "your_audio_file.mp3"
chunk_audio(file_path)