gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition
GNU Lesser General Public License v3.0
444 stars 84 forks source link

DNN miss alignment #70

Open dpny518 opened 5 years ago

dpny518 commented 5 years ago

I have been trying to do force alignment with steps/nnet3/align.sh but the alignment is really off

Is there any comments about this, i did the same steps to the other models in kaldi and they are all fine

gooofy commented 5 years ago

please have a look at

https://github.com/gooofy/py-kaldi-asr/issues/28

could this be a similar issue to what you are experiencing?

dpny518 commented 5 years ago

thank you, how would i pass the frame shift argument to which binary

steps/nnet3/align.sh
linear-to-nbest 
lattice-align-words
nbest-to-ctm 
lattice-to-phone-lattice
nbest-to-ctm

I passed to nbest-to-ctm, all it did was increase the time by multiple of 3. This was not the issue, the times are just off for example here are some measurements

True Human Measurement

Word1 .434:.912 Word2 .921:1.199 Word3 1.202:1.553 Word4 1.520:1.837 Word5 1.837:2.26

The DNN librispeech pretrained model nnet3 from kaldi's site gives this
    intervals [1]:
            xmin = 0.0
            xmax = 0.690
            text = ""
        intervals [2]:
            xmin = 0.690
            xmax = 0.750
            text = "Word1"
        intervals [3]:
            xmin = 0.750
            xmax = 1.280
            text = "Word2"
        intervals [4]:
            xmin = 1.280
            xmax = 1.610
            text = "Word3"
        intervals [5]:
            xmin = 1.610
            xmax = 1.920
            text = "Word4"
        intervals [6]:
            xmin = 1.920
            xmax = 1.980
            text = "Word5"
        intervals [7]:
            xmin = 1.980
            xmax = 2.622188
While the zamia gives this
intervals [1]:
            xmin = 0.000
            xmax = 0.020
            text = "Word1"
        intervals [2]:
            xmin = 0.020
            xmax = 1.510
            text = ""
        intervals [3]:
            xmin = 1.510
            xmax = 1.550
            text = "Word2"
        intervals [4]:
            xmin = 1.550
            xmax = 1.580
            text = "Word3"
        intervals [5]:
            xmin = 1.580
            xmax = 1.600
            text = "Word4"
        intervals [6]:
            xmin = 1.600
            xmax = 1.620
            text = "Word5"
        intervals [7]:
            xmin = 1.620
            xmax = 2.622188
            text = ""