pythonlessons / mltu

Machine Learning Training Utilities (for TensorFlow and PyTorch)
MIT License
160 stars 100 forks source link

Transcription has no stops between sentences. #38

Closed salman1851 closed 8 months ago

salman1851 commented 8 months ago

Hi! I trained the wav2vec2 model with perfect accuracy on my dataset. When I perform prediction on a full audio file, the transcriptions have no gaps between separate sentences. For example, in "transfer you to our new sales line please hold for a moment i will transfer you overthank you youre welcome stay in the linethank you for calling this call may be recorded", there should be gaps between 'over', 'thank', 'line' and 'thank'. I could just run this script for diarized segments of the original wav file, but I want to be able to transcribe the complete audio file in one go.

I am using 'mltu==1.1.7'. Here is the code for making predictions.

import numpy as np

from mltu.inferenceModel import OnnxInferenceModel
from mltu.utils.text_utils import ctc_decoder, get_cer, get_wer

import librosa
import pandas as pd
from tqdm import tqdm

class Wav2vec2(OnnxInferenceModel):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def predict(self, audio: np.ndarray):

        audio = np.expand_dims(audio, axis=0).astype(np.float32)

        preds = self.model.run(None, {self.input_name: audio})[0]

        text = ctc_decoder(preds, self.metadata["vocab"])[0]

        return text

model = Wav2vec2(model_path="Models/10_wav2vec2_torch/202310311600/model.onnx")

audio_file_path = '/media/ee/New Volume/mltu/Tutorials/10_wav2vec2_torch/Datasets/comcast_xfinity_full_audios/1.wav'
audio, sr = librosa.load(audio_file_path, sr=16000)
prediction_text = model.predict(audio)

print('predicted transcript: ', prediction_text)
pythonlessons commented 8 months ago

while training you used following vocabulary:

vocab = [' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

Yes, it doesn't have stops. So you should include dot or whatever you want

salman1851 commented 8 months ago

I'm sorry - by stops I meant gaps, as in empty space between words. I'm not interested in full stops or periods (which are denoted by the "." character). I just want the transcription to have empty spaces between speaking events. As shown in the example above, it seems that there is no gap between successive sentences (which come from separate speaking events in a full audio file).

pythonlessons commented 8 months ago

Let me know how you separate these different sentences in training data then? Because if you are not training it to learn to seperate speech sentences, it can't do it

salman1851 commented 8 months ago

I see. So, you're saying that since there is no blank space at the end of the sentence in each of my training instances, when I run the full audio through the model, there will be no blank spaces between sentences in the predicted text. It makes sense, because my dataset is defined in such a way that each training example is just one sentence. All I need to do now is add an extra blank space character at the end of each of my training example transcriptions, and we're good to go. Is that correct?

pythonlessons commented 8 months ago

Well, in theory, yes, this should work. But I would use "." for the end of sentence. Also it would be great if there would be audio examples with two different sentences, that are separated with "." I hope you understand what I mean

salman1851 commented 8 months ago

I know exactly what you mean.

pythonlessons commented 8 months ago

I know exactly what you mean.

Close this task if you find this as completed please

salman1851 commented 8 months ago

Okay. I'll make a new dataset with "." and more than one sentences. If I have any issues with that, I'll let you know.