How to transcribe from wav audio?

Dear Mr. Rolczynski, I have try to train with my language (Bahasa Indonesia), because of the alphabet's character from my language is identical with englis language, so when i do trainning (with augmentation), i set the aplphabet to english.

just try 5 epoch, the setting is similar with the example that you give it to me.

The WER and CER record for 5 epoch, each one is similar: 917s 738ms/step - loss: -0.6931 2586s 2s/step - loss: -0.1466 - val_loss: -0.6931

then when do: wer, cer = asr.evaluate.calculate_error_rates(pipeline, test_dataset) print(f'WER: {wer} CER: {cer}') the result is: WER: 1.0 CER: 1.0

the traiining result some files, that is: 'alphabet.bin, , 'decoder.bin', 'feature_extractor.bin', and 'model.h5'

from the pipy documentation, i perceive the way to transcribe is: import automatic_speech_recognition as asr

file = 'to/test/sample.wav' # sample rate 16 kHz, and 16 bit depth sample = asr.utils.read_audio(file) pipeline = asr.load('deepspeech2', lang='en') pipeline.model.summary() # TensorFlow model sentences = pipeline.predict([sample])

but i see, the script looks like try to load deepspeech2 model from tensorflow. and I don't see any manual to use all file that produce from trainning process. the error is:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-185fe2c59fa1> in <module>
      5 pipeline = asr.load('deepspeech2', lang='en',version=0.1)
      6 pipeline.model.summary()     # TensorFlow model
----> 7 sentences = pipeline.predict([sample])

~/Documents/coding/speech/Automatic-Speech-Recognition/automatic_speech_recognition/pipeline/ctc_pipeline.py in predict(self, batch_audio, **kwargs)
     92     def predict(self, batch_audio: List[np.ndarray], **kwargs) -> List[str]:
     93         """ Get ready features, and make a prediction. """
---> 94         features = self._features_extractor(batch_audio)
     95         batch_logits = self._model.predict(features, **kwargs)
     96         decoded_labels = self._decoder(batch_logits)

~/Documents/coding/speech/Automatic-Speech-Recognition/automatic_speech_recognition/features/feature_extractor.py in __call__(self, batch_audio)
      8     def __call__(self, batch_audio: List[np.ndarray]) -> np.ndarray:
      9         """ Extract features from the file list. """
---> 10         features = [self.make_features(audio) for audio in batch_audio]
     11         X = self.align(features)
     12         return X.astype(np.float16)

~/Documents/coding/speech/Automatic-Speech-Recognition/automatic_speech_recognition/features/feature_extractor.py in <listcomp>(.0)
      8     def __call__(self, batch_audio: List[np.ndarray]) -> np.ndarray:
      9         """ Extract features from the file list. """
---> 10         features = [self.make_features(audio) for audio in batch_audio]
     11         X = self.align(features)
     12         return X.astype(np.float16)

~/Documents/coding/speech/Automatic-Speech-Recognition/automatic_speech_recognition/features/spectrogram.py in make_features(self, audio)
     28         audio = self.pad(audio) if self.pad_to else audio
     29         frames = python_speech_features.sigproc.framesig(
---> 30             audio, self.frame_len, self.frame_step, self.winfunc
     31         )
     32         features = python_speech_features.sigproc.logpowspec(

~/.local/lib/python3.6/site-packages/python_speech_features/sigproc.py in framesig(sig, frame_len, frame_step, winfunc)
     31 
     32     zeros = numpy.zeros((padlen - slen,))
---> 33     padsignal = numpy.concatenate((sig,zeros))
     34 
     35     indices = numpy.tile(numpy.arange(0,frame_len),(numframes,1)) + numpy.tile(numpy.arange(0,numframes*frame_step,frame_step),(frame_len,1)).T

ValueError: all the input arrays must have same number of dimensions

I think the error caused by the tensorflow not read the model that already build / result from trainning process . i means, there is no code that reffer /binding to read the models... already try to understood your code and try to reproduce base some code in your script, e.g:

mymodel = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/model.h5')
alphabet = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/alphabet.bin')
decoder = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/decoder.bin')

and still error:


UnpicklingError                           Traceback (most recent call last)
<ipython-input-15-dc0c73ed9f58> in <module>
----> 1 mymodel = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/model.h5')
      2 alphabet = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/alphabet.bin')
      3 decoder = asr.utils.load('/home/bram/Documents/coding/speech/MyDeepSpeech/save_model/decoder.bin')

~/Documents/coding/speech/Automatic-Speech-Recognition/automatic_speech_recognition/utils/utils.py in load(file_path)
     16     """ Load arbitrary python objects from the pickled file. """
     17     with open(file_path, mode='rb') as file:
---> 18         return pickle.load(file)
     19 
     20 

UnpicklingError: invalid load key, 'H'.

because i don't understood how to use the model, maybe I can asking some question:

I check the function in pipeline.model, by dir (pipeline.model) after run: pipeline = asr.load('deepspeech2', lang='en',version=0.1)

and this is all the function inside it: 'activity_regularizer', 'add_loss', 'add_metric', 'add_update', 'add_variable', 'add_weight', 'apply', 'build', 'built', 'call', 'compile', 'compute_mask', 'compute_output_shape', 'compute_output_signature', 'count_params', 'dtype', 'dynamic', 'evaluate', 'evaluate_generator', 'fit', 'fit_generator', 'from_config', 'get_config', 'get_input_at', 'get_input_mask_at', 'get_input_shape_at', 'get_layer', 'get_losses_for', 'get_output_at', 'get_output_mask_at', 'get_output_shape_at', 'get_updates_for', 'get_weights', 'inbound_nodes', 'input', 'input_mask', 'input_names', 'input_shape', 'input_spec', 'inputs', 'layers', 'load_weights', 'losses', 'metrics', 'metrics_names', 'name', 'name_scope', 'non_trainable_variables', 'non_trainable_weights', 'optimizer', 'outbound_nodes', 'output', 'output_mask', 'output_names', 'output_shape', 'outputs', 'predict', 'predict_generator', 'predict_on_batch', 'reset_metrics', 'reset_states', 'run_eagerly', 'sample_weights', 'save', 'save_weights', 'set_weights', 'state_updates', 'stateful', 'submodules', 'summary', 'supports_masking', 'test_on_batch', 'to_json', 'to_yaml', 'train_on_batch', 'trainable', 'trainable_variables', 'trainable_weights', 'updates', 'variables', 'weights', 'with_name_scope'. which one is the fuction to load / fit the model to memory....
what the means WER 1.0 and CER 1.0 that result in evaluate script? are that's means 100% error or 1% error?
how to transcribe with that model, or i must use another repos? if the file only the *.h5 (model file), i think, i will not have many problem with that, i will try to read it with another speech recognition repos that base on h5py, but because of it resulting another file, that is: 'alphabet.bin, , 'decoder.bin', 'feature_extractor.bin'...and i'm not familiar with those file, event I guess it's alphabet.bin similar with ken-lm alphabet. but still needs time to figure out i think. may be you can give some example to transcribe using those file.
not to much in expectation, because I alredy try to build my own language speech recognition, with many failure, including with deepspeech 2, mostly failure caused by lack of dataset (mine only 3 hours dataset from Mozilla Common Voice), and that causing overfitting problem. so to solve it, I plan to implemented this method. inspired after watch this video in youtube. that's make me found your script, a python script that have possibilities to implemented that research.

btw have any idea to do that? I think it's only need arrangement, the number of the rnn_units, which is, giving the minimum value but enough to create overfitting in trainning process, then when the overfitting detected, raise the number to raise complexity continously (but not reach the limitation / capabilities of the GPU memory) , until the condition that the research means happens. so it will be a trully atomatic speech recognition engine...

But yeah, if you can give an example code, it's more perfect...

sorry for all of my request, hope inspiring something... sorry for my bad english language, i don't know, how many i edit it (after re reads again)... you can reply me here or in wahyubram82@gmail.com ...for send some code may be (just hope it ..he..he..)

cheers... and thanks...for what have you made...

rolczynski / Automatic-Speech-Recognition

How to transcribe from wav audio? #25