lowerquality / gentle

gentle forced aligner
https://lowerquality.com/gentle/
MIT License
1.43k stars 296 forks source link

Not understanding the returned results, is there a more thorough documentation? #292

Open xiabingquan opened 3 years ago

xiabingquan commented 3 years ago

JSON files returned from gentle contains lots of keys like "start", "startOffset", "end", "endOffset" and others, could anyone tell me the exact meanings of the keys above? and how to compute the timing labels of words with them?

lilgandhi1199 commented 3 years ago

Start and End = Time in seconds Start and End Offsets is literally what character position the word's first letter is and it's last letter appear in the transcript you supplied.

natelawrence commented 4 months ago
NAME TYPE RELATIONSHIP PURPOSE
transcript string Top-level variable Contains the full transcript plain-text exactly as you pasted it in (or it was generated by Gentle's automatic speech recognition)
words array Top-level variable Contains an array of word objects with timing and phoneme data for each word in the transcript
word string Child of words[] Current word, as written/capitalized in the transcript
alignedWord string Child of words[] Current word (all-lowercase) as stored in Gentle's pronunciation dictionary (<unk> means "unknown" i.e. the current word is not in Gentle's dictionary and gets rendered as OOV ("Out Of Vocabulary") in its phoneme readout space in the output HTML page.)
case string Child of words[] Indicates if the current word was successfully aligned (success) or not found in the audio (not-found-in-audio)
start number Child of words[] The start time of the current word in seconds
end number Child of words[] The end time of the current word in seconds
startOffset number Child of words[] The character offset in the transcript string where the current word begins
endOffset number Child of words[] The character offset in the transcript string where the current word ends
phones array Child of words[] Contains an array of phoneme objects for the current word
phone string Child of phones[] The ARPAbet phoneme label, which includes the phoneme name and a suffix indicating its position in the current word (_B for beginning, _I for inside, _E for end, _S for single-phoneme words)
duration number Child of phones[] The duration of the phoneme in seconds