Closed arpit601 closed 7 years ago
Thanks for noticing, I haven't had updated the README yet.
Please take a look to the new instructions & install the updated code.
Thanks..I will try to run it now.
What was the WER or accuracy you were getting on libri speech after running this code ? and on how much hours of data did you trained it ?
I am getting this error while running preprocess command AttributeError: 'Namespace' object has no attribute 'run_train_dir'
The AttributeError was fixed in https://github.com/timediv/speechT/commit/3e77583aeb2913da5f64c1995083f4c3eac491b4
I was only able to train the whole thing on a Nvidia Titan X for slightly less than a day with the new tuned hyperparameters. Of course I used all the training data from OpenSLR.
Using KenLM decoding I get the following results:
LED: 13.97109375 LER: 0.14 WED: 6.28359375 WER: 0.33
LED = Average Letter Edit Distance LER = Average Letter Error Rate WED = Average Word Edit Distance WER = Average Word Error Rate
A few examples
expected: if a rock or a rivulet or a bit of earth harder than common severed the links of the clew they followed the true eye of the scout recovered them at a distance and seldom rendered the delay of a single moment necessary
decoded: if a rock or a rivolet or a bit of her is hearter than common severed the links of the clue they followed the true eye of the scout recovered them mittedistance and sold and rendered the delay of a single moment necessary
LED: 20 LER: 0.09 WED: 9 WER: 0.21
expected: hose man's excuse for wetting the walk
decoded: how s man's excuse for wetting the walk
LED: 4 LER: 0.11 WED: 2 WER: 0.29
expected: i suppose though it's too early for them then came the explosion
decoded: i suppose though it's too early for them then came the explosion
LED: 0 LER: 0.00 WED: 0 WER: 0.00
expected: one perceives without understanding it a hideous murmur sounding almost like human accents but more nearly resembling a howl than an articulate word
decoded: one perceives without understanding meit a hideous maramer sonding almoset like human act silches but more nearly resembling the howl then in or ticule lyt to red
LED: 31 LER: 0.21 WED: 14 WER: 0.61
I suspect there is room for improvement if we let it converge. Also the hyperparameters for the AdamOptimizer are not quite numerical stable yet later during convergence. I was playing around with epsilon and the learning rate for that optimizer already, but still needs some work to do. When I got the GPU resources I will document my progress. Contributions are always welcome.
While training, I am getting this error now : outputs, channels = self._convolution(self.inputs, 48, 2, self.input_size, 250) File "/speechT/speecht/speech_model.py", line 161, in _convolution tf.summary.image(layer.name + 'filters', kernel_transposed, max_outputs=3) AttributeError: 'module' object has no attribute 'image'
I am getting this error . OSError: Failed to interpret file 'data/preprocessed-power/train/6272-70168-0034.npz' as a pickle
AttributeError: 'module' object has no attribute 'image'
You need to upgrade tensorflow to the most recent version.
OSError: Failed to interpret file 'data/preprocessed-power/train/6272-70168-0034.npz' as a pickle
Seems like your preprocessing did not run through successfully and the written file is not a valid numpy pickled file. Maybe you aborted in between? Did you process all the data first? What does
file data/preprocessed-power/train/6272-70168-0034.npz
give you?
I ran it till the end. I believe I have to run it again.
Make sure you have ffmpeg
or another audioread backend installed
thanks. I have been able to run the code, but training for each step is taking lot of time.
And value for loss is increasing with each step whereas it should drop .
What GPU are you using? Also make sure that you installed tensorflow-gpu.
And yes the network has a lot of parameters, so training naturally takes time. This is the step-time on a Nvidia GPU Titan X
Begin training
global step 19000 learning rate 0.0001 step-time 17.09 average loss 102.73 perplexity 411991288657141969762058933486912539042250752.00
Model saved
global step 20000 learning rate 0.0001 step-time 7.26 average loss 99.11 perplexity 11070795618573811456890803338274866817138688.00
Model saved
global step 21000 learning rate 0.0001 step-time 6.41 average loss 94.72 perplexity 136338261525992710564485781951704345870336.00
Model saved
global step 22000 learning rate 0.0001 step-time 4.64 average loss 103.05 perplexity 568660212049723857679414506188605833806348288.00
Model saved
global step 23000 learning rate 0.0001 step-time 5.00 average loss 99.39 perplexity 14649199450847560814167458244333624763613184.00
Model saved
global step 24000 learning rate 0.0001 step-time 4.30 average loss 82.31 perplexity 556899845207540292451724653185990656.00
Model saved
global step 25000 learning rate 0.0001 step-time 3.95 average loss 92.39 perplexity 13319465534209216868209685898412026757120.00
Model saved
global step 26000 learning rate 0.0001 step-time 3.73 average loss 87.30 perplexity 82422052954435138385449263977769992192.00
Model saved
global step 27000 learning rate 0.0001 step-time 3.69 average loss 76.83 perplexity 2324585453743597916789873982832640.00
Model saved
global step 28000 learning rate 0.0001 step-time 3.62 average loss 72.66 perplexity 36092129951576674294364291203072.00
Model saved
I am not using GPU as of now. Do I have to use GPU ?
I highly recommend it, even with 48 CPU cores running on a high tier GPU may be 20 times faster.
Q1. Which files do I have to change if I have a data set of my own in .wav format ? Q2. What do you mean by development data ?
I am getting this error. File "speecht-cli", line 74 def _add_language_model_argument(self, parser: argparse.ArgumentParser): ^ SyntaxError: invalid syntax
Concerning the error: use python 3 Q1 corpus.py and preprocessing.py Q2 validation set, currently only used by language model parameter search
Thanks .
I have been able to modify changes to the code to accommodate my data set. Preprocessing was successful , but when I try to train it shows this CTC error ... InvalidArgumentError (see above for traceback): Saw a non-null label (index >= num_classes - 1) following a null la bel, batch: 9 num_classes: 29 labels: 7,4,24,27,6,20,24,18,27,19,7,0,19,27,5,4,1,27,5,14,20,17,27,0,13,3,27,0,27,16 ,20,0,17,19,4,17,27,2,0,11,11,27,22,8,19,7,27,19,22,4,13,19,24,27,18,4,21,4,13,18,27,8,26,12,27,13,8,13,4,27,4,8,6, 7,19,24,27,1,8,3,27,0,19,27,19,4,13,27,22,8,19,7,27,0,27,18,4,11,11,4,17,27,9,20,18,19,27,1,4,27,0,19,27,19,4,13,27 ,1,4,7,8,13,3 [[Node: training/CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_repeated=false, _device="/ job:localhost/replica:0/task:0/cpu:0"](transpose, DeserializeManySparse, DeserializeManySparse:1, training/floordiv )]]
@timediv I have been able to solve the above issue by cleaning the transcripts of the audio files.
After running preprocess file,I tried running training.py and there is nothing happening. There is nothing inside the training.py to start training of the file.