lowerquality / gentle

gentle forced aligner
https://lowerquality.com/gentle/
MIT License
1.42k stars 293 forks source link

Matching the timing information in the CSV to a complete transcript word list... #78

Open natelawrence opened 8 years ago

natelawrence commented 8 years ago

I apologize for asking such a simple question, but I would like to take the timing information from the words which Gentle has matched (as represented in the CSV output file) and align those with a list of every single word in the transcript.

That is to say, I wish to have a similar CSV file where every single word in the transcript is seen in the order that it occurred. (This is not complicated. I can derive this by searching for spaces, hyphens which are directly adjacent to alpha characters, and periods/full stops which are directly followed by an alpha character and replacing them with said character (space, hyphen, or period) plus a line break in a word processor.)

But I wish to have all of Gentle's timing information next to the words which it has matched. This will directly open up a little bit of basic search and replacing and allow me to paste the entire transcript into a subtitle and caption editor such as Aegisub in order to be able to use its GUI to manually correct any errant timing from Gentle and also create timing information for words which could not be automatically aligned.

The resulting subtitle can be pasted back into Excel (or the spreadsheet of your choice) and presumably mapped back to Gentle's HTML output file.

If you could point me to the simplest way of arriving at this end, I would be very grateful, as this will remove some roadblocks that I've been facing for the better part of a month.

strob commented 8 years ago

The behavior you're describing sounds like a bug: the CSV should include all words in the transcript. Can you verify that it's still a problem in the latest version by sending me a gentle-demo.lowerquality.com link where the CSV is missing some words from the transcript?

natelawrence commented 8 years ago

I was under the impression that the CSV and JSON only listed matched/aligned words.

Here are two of my most recently processed alignments: 95df14c8 (Verbatim for publishing) 87c8c262 (Verbatim with some homophone substitutions for increased alignment)

Both were processed approximately 48 hours ago.

Examples of missing words from the full transcripts: In 95df14c8 the fourth and fifth word in the transcript were not matched and will not appear in the CSV. In 87c8c262 the phrase "their earlier" from the sentence containing "standard practice" is not found in the CSV.

After re-examining the CSV for those alignments, it seems that in cases where there are consecutive unmatched words, sometimes only the first word(s) is/are included in the CSV.

natelawrence commented 8 years ago

Here's a third iteration of this transcript's alignment (just processed this morning): 3f101cae

Missing word examples: (The words in bold were omitted from Gentle's CSV.) "as a bit about the history of Nintendo" "most o' their earlier arcade games" "career took off when he was called in" "been interested in getting into the home console market" "you'll start seeing a little demo of the game playing" "like the fact that [the] little barrels don't explode" "when you hit 'em with the, uh, hammer)" "this is a pretty auspicious start" "The main question here is," "I certainly remember seeing it quite a bit in the arcades" "a bit more of a little story being told" "and send him crashing into the ocean" "fast forward to August NINETEEN-EIGHTY-THREE, for the first original titles" "for your new console; what are you gonna' do next?" "leading to a lot [of] frustration in the United States and Europe" "Other [SUITES] have a series of dots" "Royal MA JONG, from KNEE CHI BOOT SUE" "since the advent of computer games, a certain, sort of" "the shadows of Donkey Kong (also notable for being the first game" "Obviously the colors are a bit more brighter" "ported to the system, which raises the obvious question, "What were they gonna' do next?"." "Okay, let's set the stage, here." "It's a game that's supposed to teach" "word in Japanese and you have to translate it into English" "Let's [see] what actually happens." "There's a few different GAME PLAY variations, here." "Other than the one we just saw" "it might be a bit difficult to guess" "We enter the all-important month" "Let's take a look at some, now." "The CULL-LEE CO-VISION was the most graphically advanced" "in no way affiliated with any sort of, like, Major League Baseball team" "we have one that you probably hated," "a number, up at the top of the screen" "not a game that you'd really enjoy playing now." "pace of new FAMINE COM games pick up until pretty late in NINETEEN-EIGHTY-FOUR." "Alright! So let's get January NINETEEN-EIGHTY-FOUR rolling" "This was, in fact, the only game released" "There's a bit of perspective in the court." "he's this little short guy" "go on to put him in all their golf games" "[in] the days of the TWENTY-SIX-HUNDRED" "I'd say it's a lot better than some of the" "Alright, so you've done some basic board games," "they had been [in] arcades much longer" "and if you shoot back up," "there's, like, a little hole that you can shoot" "The other game to be released in February of NINETEEN-EIGHTY-FOUR is Wild Gunman." "First of all, this is not exactly a port" "I guess, uh, in the United States, at that time," "(well, Nintendo had been developing light gun toys" "rather with the next game we're going to see." "As ya' may have noticed there" "I may [as] well admit," "back in the day, you probably remember" "I'm sure a lot o' people never bought another ZAP PER game," "he would get upset if you did." "Okay, and we do actually have one more light gun game coming up, in just a few minutes." "and then press it a third time," "using putting, uh, but you can hit it" "I guess it is a little weird to see" "most of the subsequent games" "By today's standards, the game is a little on the simple side," "and what targets not to hit)." "Hogan's Alley is, once again, a light gun game." "The object here is to, well, basically shoot the bad guys." "This is sort of like a training simulator, here, for cops or something like that." "These little, uh, cardboard figures slide out and you want to shoot" "you'll notice there's a, uh, a miss counter." "ten misses and then the game ends." "What exactly does the name 'Hogan's Alley' mean?" "Well, a Hogan's Alley is sort of" "you tend to think of all those" "and this is the first game that we've seen" "and get 'em to land" "It was something that was kind of, like," "normally put on a list of all time favorite games" "a few of these games, such as" "In the mean time, keep an eye out"

natelawrence commented 8 years ago

@strob if there's any other information I can provide to help you track down the cause of this bug so that you can fix it, please let me know.