persephone-tools / persephone

A tool for automatic phoneme transcription
Apache License 2.0
154 stars 26 forks source link

[Yongning Na] detection of tone-group boundaries: top-down from tone sequence, or by acoustic model #5

Closed alexis-michaud closed 6 years ago

alexis-michaud commented 6 years ago

The tone group is a really essential unit in Na, because the tone rules only apply within the tone group.

Let's first dream of what would be possible if the tone group boundaries could be identified automatically (but I'll return to the low-hanging fruit at the end of this Issue).

Since the tone group boundaries are indicated in the transcription, would there be a way to add them to the model, so that the acoustic model would try to identify them from the audio? Even if accuracy is low at first, this would be a qualitative leap. Pauses almost always correspond to a tone-group boundary: the oral filled pauses and nasal filled pauses, now encoded in a cleaner way than before, as mmm and əəə. Silent pauses are also good evidence for tone-group boundaries.

This would allow for automatic correction and decrease TER. For instance for this example (Benevolence.1): my transcription:
ə˧ʝi˧-ʂɯ˥ʝi˩, | zo˩no˥, | nɑ˩ ʈʂʰɯ˥-dʑo˩, | zo˩no˥, | le˧-ʐwɤ˩

output of mam/persephone in May 2017 (lightly edited): ə ˧ ʝ i ˧ ʂɯ ˥ ʝ i ˩ z o ˩ n o ˧ n ɑ ˩ ʈʂʰ ɯ ˥ dʑ o ˩ z o ˩ n o ˧ l e ˧ ʐ w ɤ ˩

Let's imagine we get the following output from persephone, with tone-group boundaries added: ə ˧ ʝ i ˧ ʂɯ ˥ ʝ i ˩ | z o ˩ n o ˧ | n ɑ ˩ ʈʂʰ ɯ ˥ dʑ o ˩ | z o ˩ n o ˧ | l e ˧ ʐ w ɤ ˩

Transcription as /z o ˩ n o ˧/ is phonetically good: the tone of /n o ˧/ is non-low, and from a phonetic point of view, that's that. But knowing that there is a tone-group boundary coming after, it can be rewritten as High (˥), on the basis of Rule 6. And that's a gain on the TER scale. Out of the 13 tones, 11 were identified correctly; with this correction, tonal identification would reach 100% (TER: 0%). Victory!

A back-and-forth process between tone-group boundary identification and tonal string identification could be imagined:

Reading the "narrative" paper, Martine Adda-Decker seemed optimistic about the possibility of identifying tone-group boundaries from the audio (somewhat similar to intonational phrasing in English/French...) with reasonably good accuracy. (Of course I have no idea how that could be added to persephone.)

Another possibility would be to conduct top-down detection of tone-group boundaries, followed by 'sanity check'. Top-down detection of tone-group boundaries could be done on the basis of the tonal string. Thus (theoretical example), suppose we get this string of syllables from persephone: æ˧ æ˩ æ˩˥ æ˧ æ˥ æ˧ æ˧ æ˧˥ Since contours (all of which, in Na, are rising) only occur at the right edge of a tone group, boundaries can be added after contours: æ˧ æ˩ æ˩˥ | æ˧ æ˥ æ˧ æ˧ æ˧˥ | Next, /æ˧ æ˩ æ˩˥/ can be parsed into either /æ˧ | æ˩ æ˩˥/ or /æ˧ æ˩ | æ˩˥/, because M.L.LH is not well-formed ('trough-shaped' sequence). It would be beautiful if the choice between the 2 possibilities ( σ | σ σ or σ σ | σ ) could be done on the basis of the acoustics, "asking" the "detector" which of the 2 is more plausible statistically. And finally, æ˧ æ˥ æ˧ æ˧ æ˧˥ | needs to be parsed into /æ˧ æ˥ | æ˧ æ˧ æ˧˥ |/ because M.H.M is not well-formed (inside a tone group, H can only be followed by L: Rule 4).

If this theoretical example makes sense to you, I can try to come up with real examples where "top-down" tone-group boundary detection would lead to hypotheses about corrections that need to be made to the tonal string (with the prospect of lowering TER a great deal).

Now back to the low-hanging fruit, supposing we don't have information on tone-group boundaries. An automated search would probably confirm that H.H sequences never occur (an occasional loanword or such could be the exception). That is because H can only be followed by L inside a tone group, so H.H is not valid inside a tone group, and must be parsed as H | ... and the next tone group can only begin with L or M. So a detected H.H is probably to be analyzed as H | M. But I can't think of many other such generalizations. Provisional conclusion: there's not much low-hanging fruit if tone-group boundaries are not included in the model. (Relevant reading from the book: Chapters 7, pp. 321-328.)

alexis-michaud commented 6 years ago

Done: the acoustic models created in 2018 now include tone-group boundaries.