Open Liontooth opened 8 years ago
Dear David,
Incorporating existing timestamps can definitely help! The gentle/multipass.py
file may give some hints on how to assemble a single alignment from many time-limited chunks of a long file.
The python API is currently in some flux, but I hope that providing programmatic access to Gentle will make it easier to do things like this.
Another suggestion: the "conservative" and "disfluencies" options may further improve results with imperfect transcripts.
See also #78, #81, and #117. I'm imagining that Gentle's alignment search could be restricted to a temporal window defined by existing timestamps, to encourage Gentle to add precision. The source of the input timestamps may be manual annotation (Aegisub, ELAN) or recordings (srt files, closed captions, teletext). These typically provide line-based timings; Gentle could perhaps accept srt input along these lines (it's easy enough to convert other formats to srt):
`1 00:00:20,000 --> 00:00:24,400 In connection with a dramatic decrease in crime in many neighborhoods,
2 00:00:24,600 --> 00:00:28,800 the government is slimming down police departments in a synergy-related headcount restructuring ...`
along with a flag for slack in seconds that would be added to the size of the window on either side.
If Gentle doesn't have anyone to take this on, let me suggest you apply to GSoC2017; this would be a great task for a summer of code for a skilled student, and Red Hen Lab would be happy to work with you.
A quick test to see how Gentle handles gaps in transcripts reveals the following.
Intact examples/data/lucier.txt showing end of the first sentence and start of the third:
Second sentence removed:
The alignment gradually recovers from the 40-second gap until the tenth word is perfect. Nine badly or imperfectly aligned words is the cost of the transition.
Intact lucier.txt showing end of the first and start of the fourth sentence:
Second and third sentences removed:
Perfect recovery.
So this is impressive and reassuring, yet there is room for improvement. With long gaps and a poor transcript, mistakes will be common.
We have transcripts with timestamps, but they are inexact -- late by 5 to 10 seconds. Could you point us to a way we can feed this information to gentle to help it handle gaps more robustly?
For any given word, there will be a temporal range of, say, twenty seconds; the search for a match should be limited to this range. The input file should be similar to the current align.csv output --
-- that is to say, each word is given a range within which the search should be performed. Maybe some of the logic for this is already present in the second pass?
Cheers, David