persephone-tools / persephone

A tool for automatic phoneme transcription
Apache License 2.0
155 stars 26 forks source link

Making use of audio chunks of more than 10 seconds #230

Open alexis-michaud opened 4 years ago

alexis-michaud commented 4 years ago

Currently, the upper limit on the duration of audio chunks taken as input by Persephone is 10 seconds. This is an issue for the real-world deployment of Persephone, because many documents in archives such as the Pangloss Collection are divided into longer chunks.

Thus, the document “Romanmangan, the fairy from the other world" has a duration of 1,890 seconds, and is divided into 212 sentences. Seventy sentences, amounting to more than half of the total duration of this substantial story, are above the 10-second limit, and thus not used in training.

A suggestion from a reviewer of a paper at SLTU is to perform Voice Activity Detection (VAD), to detect silence and non-silence, and then cut the long waveform at silence part into smaller pieces. This way, we may still use all the data for training.

oadams commented 4 years ago

Yeah, detecting voices and breaking on silence is definitely a good angle to take. However, for training data it doesn't fully solve the problem because we still need to know what parts of the transcription correspond to that chunk. One useful approach would be to do forced alignment as an initial approach, then chunk based on silence, then feed it into training.

alexis-michaud commented 4 years ago

👍 Yes, this is a more ambitious and promising approach than what I had in mind. It's the way to go.

I had in mind cases where, once silences are removed, the chunk gets down to under 10 seconds and can be used in the training set without splitting the transcription. Then VAD is enough to cram the chunk into the training set. But implementing the more ambitious solution is better, as it is more general (addressing all cases); Removing silences is not 'clean', as it comes at the cost of compromised audio. Simply removing silences tampers with the original signal, removing useful cues (pauses are part of the structure, and removing them can create acoustic 'monsters').