gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition
GNU Lesser General Public License v3.0
443 stars 86 forks source link

Number of epoch needed on 3000 hours data #96

Closed cogmeta closed 4 years ago

cogmeta commented 4 years ago

Probably this question is best suited for kaldi forum but wanted to have @gooofy opinion. I have been training new models on 3000 hours data. It is all the data zamia models exports plus additional 2000 hours of proprietary data (which is relatively clean). We trained tdnn_250 model for 8 epoch but the unfortunately the accuracy was disappointing %WER 22.13 [ 557395 / 2518522, 69722 ins, 123201 del, 364472 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_9_0.0. The models performs as good if not better as last released model on live transcribing but poor accuracy RESULT is baffling.

@gooofy Do you think something went wrong or just we need train for more number of epochs?

cogmeta commented 4 years ago

%WER 24.16 [ 608452 / 2518522, 93422 ins, 109587 del, 405443 sub ] %WER 23.93 [ 602679 / 2518522, 46333 ins, 197628 del, 358718 sub ] %WER 23.63 [ 595183 / 2518522, 40322 ins, 207093 del, 347768 sub ] %WER 23.47 [ 591142 / 2518522, 37202 ins, 212258 del, 341682 sub ] %WER 23.30 [ 586840 / 2518522, 97753 ins, 100098 del, 388989 sub ] %WER 23.22 [ 584865 / 2518522, 50843 ins, 174562 del, 359460 sub ] %WER 23.18 [ 583848 / 2518522, 88503 ins, 108558 del, 386787 sub ] %WER 23.14 [ 582885 / 2518522, 80105 ins, 117637 del, 385143 sub ] %WER 22.94 [ 577803 / 2518522, 45572 ins, 180918 del, 351313 sub ] %WER 22.83 [ 574978 / 2518522, 43309 ins, 183909 del, 347760 sub ] %WER 22.68 [ 571166 / 2518522, 56320 ins, 154026 del, 360820 sub ] %WER 22.59 [ 568898 / 2518522, 87144 ins, 105520 del, 376234 sub ] %WER 22.55 [ 567855 / 2518522, 68586 ins, 128154 del, 371115 sub ] %WER 22.51 [ 566796 / 2518522, 77238 ins, 116264 del, 373294 sub ] %WER 22.45 [ 565336 / 2518522, 51996 ins, 158619 del, 354721 sub ] %WER 22.44 [ 565148 / 2518522, 50524 ins, 160324 del, 354300 sub ] %WER 22.34 [ 562658 / 2518522, 58939 ins, 141981 del, 361738 sub ] %WER 22.31 [ 561801 / 2518522, 62608 ins, 136694 del, 362499 sub ] %WER 22.22 [ 559639 / 2518522, 77728 ins, 113057 del, 368854 sub ] %WER 22.22 [ 559583 / 2518522, 67685 ins, 126892 del, 365006 sub ] %WER 22.19 [ 558970 / 2518522, 59289 ins, 140656 del, 359025 sub ] %WER 22.13 [ 557395 / 2518522, 69722 ins, 123201 del, 364472 sub ]

gooofy commented 4 years ago

from my experience 8 epochs should be more than enough (my tdnn_f model was trained for 6 epochs)

from my experience I would check two things at this point:

cogmeta commented 4 years ago

I did adapt the language model. I think the quality of some the data added later is bad. For example, common_voice v2. The WER is around 30%...horrible.

joazoa commented 4 years ago

Filter out the worst with the kaldi or w2l quality script then try again. Check how much background noise there is, if all files start and end with silence, if the volume is ok and if you have clipping going on.

cogmeta commented 4 years ago

I checked dozens of them. It is just really bad accent.

I have my own script to remove bad ones. But, Which script are you referring to? Can you please send me the link.

cogmeta commented 4 years ago

tdnn_f model finished just now and WER is 16% but is really good on live transcription.