srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
822 stars 342 forks source link

Difference in the length between labeled targets and feature samples #116

Closed razor1179 closed 7 years ago

razor1179 commented 7 years ago

Hi, I noticed that there is large difference between the number of targets(labels) for a specific utterance in the 'labels.tr.gz' and the number of feature samples used for training, i.e. fbank features for the same utterance containing multiple frames. I initially assumed that each target value corresponded a certain number of fbank frames but I cannot obtain any relationship between the number of labels and number of frames of features for a specific utterance. Hence I was wondering how the alignment of a phoneme (label between 1-46) and features extracted for that phoneme is done.

Also I would like to know the arrangement of the weights in the .nnet file, is it input weights first, then forget cell weights, out weights or is there some other order, and also for the biases.

Regards, Deepak

fmetze commented 7 years ago

The training criterion for CTC is based on the sequence of output symbols, and not on frames, thus there is no hard alignment between output symbols and feature frames, see the original CTC papers.