Difference in the length between labeled targets and feature samples

Hi, I noticed that there is large difference between the number of targets(labels) for a specific utterance in the 'labels.tr.gz' and the number of feature samples used for training, i.e. fbank features for the same utterance containing multiple frames. I initially assumed that each target value corresponded a certain number of fbank frames but I cannot obtain any relationship between the number of labels and number of frames of features for a specific utterance. Hence I was wondering how the alignment of a phoneme (label between 1-46) and features extracted for that phoneme is done.

Also I would like to know the arrangement of the weights in the .nnet file, is it input weights first, then forget cell weights, out weights or is there some other order, and also for the biases.

Regards, Deepak

srvk / eesen

Difference in the length between labeled targets and feature samples #116