Open alexis-michaud opened 6 years ago
This relates to #214, in that the word boundary in the training corpus is a space.
"it's important that if users want to explictly predict spaces (in character prediction), then that is accounted for. Probably best with a flag to segment_into_chars() or something similar, which would generate special tokens that represent spaces, such as underscores, for training and decoding. These then would get removed as a postprocessing step."
Since 2018, the model for Na includes tone-group boundaries. But up till now (Oct. 2018), the model for Na still disregards word boundaries. A look at story-fold cross-validation materials suggests that longer words have somewhat different acoustic properties. So there could be value for phoneme & tone recognition in adding word boundaries to the training.
A first step (suggested by @oadams ) could be to produce separate error rates for short words versus longer words by using the word segmentation in the reference transcription as a guide.
(Suggested label for this Issue: Yongning Na)