Open BenjaminTrapani opened 6 years ago
Edit distance error looks correct as of 3dc66d0304bad2406ce9019279452f3ed77e6efd and after switching from Ubuntu to Windows 10. This was tested on a version of CNTK that is modified slightly from the commit above to work with CUDA 9 and CuDNN 7 (just a small update to the RNN CUDA API call).
It looks like the one-hot-encoded text labels should be 2: https://github.com/Microsoft/CNTK/blob/master/Source/SequenceTrainingLib/gammacalculation.h#L333
After encoding all labels as 2 instead of 1, the loss increases to 10^14 or so and the optimizer fails to train the model. Does encoding the text labels as phone boundaries make sense considering the current CTC implementation?
The following code is used to obtain the CTC loss and edit distance using the C++ API:
I have verified that
labelInput
is formatted correctly and matches the training data. It is of shape [257 x 8 x 4] where 257 is the number of classes (256 + 1 for blank label), 8 is the sequence length and there are 4 batches. Exactly 1 value on the first axis is 1 (the index indicates the class) and the rest are 0. blankTokenID=256. modelFn is a linear projection from an OptimizedRNNStack without activation, and has the same shape as labelInput. The values for loss, edit distance and decoded values for the 1000th training batch are below:The above decoding is performed manually on the CPU by argmaxing over each row in the resulting sequence. The metrics are obtained from the ProgressWriter OnWriteTrainingSummary function, invoked every 1000 batches. It seems like the edit distance should be non-zero given the decoded values above (the decoded labels do not match the expected training labels). Am I formatting the inputs to EditDistanceError incorrectly? Additionally the model converges to yielding only the blank character, although that is likely an issue with the architecture itself and not the implementation.