lowerquality / gentle

gentle forced aligner
https://lowerquality.com/gentle/
MIT License
1.41k stars 293 forks source link

Option for caching result of transcription pass #115

Open JanX2 opened 7 years ago

JanX2 commented 7 years ago

I currently use Gentle to align existing transcripts to audio. Specifically, I want to align LibriVox recordings to Project Gutenberg source.

The main issue I am facing is, that there occasionally are differences between the versions. This can go as far as a paragraph or other segment of the text missing from the recording. What I will do is edit the transcript to match the recording and rerun Gentle. This takes a lot of CPU and wall clock time.

It would make things a great deal easier for these kinds of scenarios, if the result of the audio transcription pass could be cached. The audio doesn’t change.

One way of achieving this would be to add the raw Kaldi audio transcript to the ZIP/output in serialized form. This way, it can optionally be supplied together with the audio by the user.

Edit: I just realized, that the results are already cached in ~/.gentle/webdata. What about hashing the file name, checking that against the cache. For a hit, hash the audio data and check, if there is an entry for the current version of the language model or Gentle. Use that, if available.

strob commented 7 years ago

Hi! Thanks for writing. Sounds like an interesting project you're working on.

There's been some interest (#81, #99) in an API that exposes partial alignment (ie. not only caching the audio file, but re-running only certain time-regions). The code in gentle/multipass.py shows what the basic approach would look like.

In my experience, the upload/encoding of the audio file takes negligible time compared to the alignment, so I don't think you'll get a big speed boost unless/until we implement a partial alignment API. I would gladly support changes to Gentle's API so that it can be used for a "transcription correction" interface.

natelawrence commented 3 years ago

With apologies for resurrecting this thread, the redundant storage of the transcoded audio when iterating on a transcript is a vexing issue in terms of storage when working on long audio files.


I would very much like to see the concept of a library of media-files/transcription-projects added to Gentle so that one stores the transcoded media once and then a history of alignments can be associated to each piece of media.


Diverging even further from the topic of this thread would be to add the concept of a chapters/playlists/series to Gentle, such that each chapter in a book, episode in a podcast, song in an album, etc. could be ordered appropriately and linked together such that when one piece finishes playing, the next automatically begins.