lowerquality / gentle

gentle forced aligner
https://lowerquality.com/gentle/
MIT License
1.44k stars 296 forks source link

Using existing inexact timestamps #103

Open Liontooth opened 8 years ago

Liontooth commented 8 years ago

A quick test to see how Gentle handles gaps in transcripts reveals the following.

Intact examples/data/lucier.txt showing end of the first sentence and start of the third:

now,now,13.68,14.14 ... What,what,56.42,56.660000000000004 you,you,56.660000000000004,56.800000000000004 will,will,56.800000000000004,56.970000000000006 hear,hear,56.970000000000006,57.32000000000001 then,then,57.32000000000001,57.86 the,the,59.63,60.36 natural,natural,60.7,61.660000000000004 resonant,resonant,62.17,63.38 frequencies,frequencies,63.38,64.10000000000001

Second sentence removed:

now,now,13.68,14.12 What,what,15.33,15.42 you,you,17.26,17.330000000000002 will,will,22.19,22.28 hear,hear,27.26,27.37 then,then,27.42,27.82 are,are,62.19,62.519999999999996 the,the,62.55,62.61 natural,natural,62.67,62.86 resonant,resonant,62.86,63.38 frequencies,frequencies,63.38,64.10000000000001

The alignment gradually recovers from the 40-second gap until the tenth word is perfect. Nine badly or imperfectly aligned words is the cost of the transition.

Intact lucier.txt showing end of the first and start of the fourth sentence:

now,now,13.68,14.12 ... I,i,73.06,73.27 regard,regard,73.27,74.02 this,this,75.93,76.21000000000001 activity,activity,76.21,77.1

Second and third sentences removed:

now,now,13.68,14.14 I,i,73.06,73.27 regard,regard,73.27,74.02 this,this,75.93,76.21000000000001 activity,activity,76.21,77.1

Perfect recovery.

So this is impressive and reassuring, yet there is room for improvement. With long gaps and a poor transcript, mistakes will be common.

We have transcripts with timestamps, but they are inexact -- late by 5 to 10 seconds. Could you point us to a way we can feed this information to gentle to help it handle gaps more robustly?

For any given word, there will be a temporal range of, say, twenty seconds; the search for a match should be limited to this range. The input file should be similar to the current align.csv output --

what,46,65 you,46,66 will,47,67

-- that is to say, each word is given a range within which the search should be performed. Maybe some of the logic for this is already present in the second pass?

Cheers, David

strob commented 8 years ago

Dear David,

Incorporating existing timestamps can definitely help! The gentle/multipass.py file may give some hints on how to assemble a single alignment from many time-limited chunks of a long file.

The python API is currently in some flux, but I hope that providing programmatic access to Gentle will make it easier to do things like this.

Another suggestion: the "conservative" and "disfluencies" options may further improve results with imperfect transcripts.

Liontooth commented 7 years ago

See also #78, #81, and #117. I'm imagining that Gentle's alignment search could be restricted to a temporal window defined by existing timestamps, to encourage Gentle to add precision. The source of the input timestamps may be manual annotation (Aegisub, ELAN) or recordings (srt files, closed captions, teletext). These typically provide line-based timings; Gentle could perhaps accept srt input along these lines (it's easy enough to convert other formats to srt):

`1 00:00:20,000 --> 00:00:24,400 In connection with a dramatic decrease in crime in many neighborhoods,

2 00:00:24,600 --> 00:00:28,800 the government is slimming down police departments in a synergy-related headcount restructuring ...`

along with a flag for slack in seconds that would be added to the size of the window on either side.

If Gentle doesn't have anyone to take this on, let me suggest you apply to GSoC2017; this would be a great task for a summer of code for a skilled student, and Red Hen Lab would be happy to work with you.