Moving window rather than 'hard' chunking into 10-second chunks?

The source recordings are split into 10-second chunks, right? This makes it harder to identify phonemes at the edges of these 10-second chunks: not only is phonemic context lacking (at the 'left' for the 1st phoneme, at the 'right' for the last), but the phone astraddle the 10-second-chunk boundary get cut rather brutally 🔪. This creates a maimed stub that is harder to identify.

What about using a moving window to smoothe the edges? This could improve recognition of sounds found at boundaries: at 0 s, 10s, 20s, etc. Instead of

0-10 s
10-20 s
20-30 s

Persephone would deal with overlapping chunks. Added chunks are highlighted below:

0-10 s
5-15 s
10-20 s
15-25 s
20-30 s

In case of mismatch between successive windows, for instance if the transcription for 0-10 s ends on a /f/ whereas the 5-15 s has /s/ at the same position inside the string, then the 'mid-file transcription (found in the chunk where the target phone has not lost its integrity & sits snugly in the middle of a pristine context) will be favoured over the 'maimed-stub transcription' (found in the chunk where the target phone is at, or close to, an edge) and the /s/ would be retained, not the /f/.

Of course it is likely to get more complex than this, since probabilities for successive phonemes are not independent from one another. But intuitively it seems clear that there is room for improvement by addressing the issue of transition from one audio chunk to the next. Options for 'sensitive' choice of boundaries could include detection of long pauses, in-breath...

I think this is a really good point you have raised here. Right now I have a bunch of other tasks I have to do with the UI but I'll look into this more when I'm done with those. I suspect there's some way of dealing with this by splitting the audio at sections that have a sufficient duration of silence but there's a bit of work that has to be done there to get it to work cleanly in all the edge cases.

Yes, 10 seconds is a long time, for speech. It should be possible to find and exploit landmarks such as breathing (breath groups).

Maybe add a 'signal processing' label to this issue, as well as to #4 #7 #39 #111 ? So that when a signal processing expert joins the team, it will be possible to list relevant issues in one fell swoop.

This is a good idea and should be implemented. It would be straight-forward to stitch the overlapping windows together using some edit-distance matching to create one contiguous transcription.

Another idea (similar to the breathing idea mentioned) would be to break on pauses and silences. This doesn't completely resolve the issue though since there might still conceivably be > 10s segments with no clearly discernable silence.

It would be straight-forward to stitch the overlapping windows together using some edit-distance matching to create one contiguous transcription.

Is it really straightforward to stitch when there is overlap between successive audio windows? Edit-distance matching will allow for stitching, but any mismatches in the overlapping portion of the 2 chunks will need to be resolved. For instance if transcription of chunk 0 ends with ... ʈʰ ɯ ˧ z e ˧ m and transcription of chunk 1 begins with ɯ ˧ s e ˧ m ɑ ˧ ɳ ɯ ˩... (assuming an overlap of a few phonemes between the 2 audio windows, and a difference between z and s in the transcriptions), what will be returned to the user: ... ʈʰ ɯ ˧ z e ˧ m ɑ ˧ ɳ ɯ ˩... or ... ʈʰ ɯ ˧ s e ˧ m ɑ ˧ ɳ ɯ ˩...?

How likely is it that the same sounds will be transcribed differently by the software when they are part of different chunks of audio (in this hypothetical example: z vs. s)? That is an empirical question to look at. My guess would be that such cases will occur often (most frequently for tone).

One could dream of a 'heatmapping' or '3D' display for manual verification, which would have the 1st candidate foregrounded but with the 2nd best candidate visible 'between the lines', as it were. This might also make sense when Persephone is combined with other software such as a phonological well-formedness checker, or a tool that identifies words in lattices of phonemes. It could be good to explore alternative options when the 1st output does not match any word or is not phonologically well-formed.

To detect sentences (call it "sentence-ish units", like the <S> units in the Pangloss/CoCoON format), identifying breath groups could work very well, because the duration is about right (lots of variation of course, but average values on the order of 3 to 5 seconds have been reported for English). A big advantage in view of the intended user group is that breath groups make good linguistic sense (usually matching junctures in linguistic phrasing). For linguists, it would be cool to have the in-breath locations marked in the automatic transcription (output of Persephone).

Speech processing people would know how hard it is to identify breath-groups in the signal, either as such (identifying spectral cues to in-breath, in ideal cases: signal from a head-worn microphone in good conditions) or as silent pauses or through other cues such as f0 declination inside the breath group (when signal-to-noise ratio is not good enough to allow acoustic detection of in-breath).

Is it really straightforward to stitch when there is overlap between successive audio windows? Edit-distance matching will allow for stitching, but any mismatches in the overlapping portion of the 2 chunks will need to be resolved. For instance if transcription of chunk 0 ends with ... ʈʰ ɯ ˧ z e ˧ m and transcription of chunk 1 begins with ɯ ˧ s e ˧ m ɑ ˧ ɳ ɯ ˩... (assuming an overlap of a few phonemes between the 2 audio windows, and a difference between z and s in the transcriptions), what will be returned to the user: ... ʈʰ ɯ ˧ z e ˧ m ɑ ˧ ɳ ɯ ˩... or ... ʈʰ ɯ ˧ s e ˧ m ɑ ˧ ɳ ɯ ˩...?

So there's two parts of this problem. The first is, given two segments A and B, ensure that the part unique to B immediately follows A (or that the part unique to A immediately precedes B). With any reasonable overlap between the strings and any reasonably low phoneme error rate, this can done with high confidence using fuzzy string matching. The more A and B overlap time-wise the less likely a mistake is (exponentially so).

The second problem is how to resolve differences. The straightforward approach here is to take the hypothesis with more confidence. The most correct way to do this would involve summing over all the paths in the CTC trellis that currespond to suffixes of A and those corresponding to prefixes of B, and taking the most likely output. The easiest way (which makes the most sense in our context given that we're doing greedy 1-best decoding) would be to just take all the CTC output probabilities in our 1-best path that correspond to that phoneme in that part of the sequence and sum over them. Then we compare and take the more likely one. This probably doesn't make any sense to the reader but it's partly a note to myself for future.

How likely is it that the same sounds will be transcribed differently by the software when they are part of different chunks of audio (in this hypothetical example: z vs. s)? That is an empirical question to look at. My guess would be that such cases will occur often (most frequently for tone).

It's an interesting empirical question which would actually yield insight into how much the LSTM is relying on long-range information in order to make its decisions. I agree that it would happen most often for tone, but I'm not sure whether it will occur that often.

One could dream of a 'heatmapping' or '3D' display for manual verification, which would have the 1st candidate foregrounded but with the 2nd best candidate visible 'between the lines', as it were. This might also make sense when Persephone is combined with other software such as a phonological well-formedness checker, or a tool that identifies words in lattices of phonemes. It could be good to explore alternative options when the 1st output does not match any word or is not phonologically well-formed.

Presenting alternative hypotheses via a beautifully displayed lattice or similar is something that's been at the back of my mind for a long time now. This would be great to incorporate into a web front-end, especially in the context of an iterative training pipeline where a linguists corrections are fed back into model training.

To detect sentences (call it "sentence-ish units", like the units in the Pangloss/CoCoON format), identifying breath groups could work very well, because the duration is about right (lots of variation of course, but average values on the order of 3 to 5 seconds have been reported for English).

~~Good to know this figure!~~

Presenting alternative hypotheses via a beautifully displayed lattice or similar is something that's been at the back of my mind for a long time now. This would be great to incorporate into a web front-end, especially in the context of an iterative training pipeline where a linguists corrections are fed back into model training.

Would be keen to talk about this UI when we get to the point of a more fully featured implementation for a front-end

persephone-tools / persephone

Moving window rather than 'hard' chunking into 10-second chunks? #195