lowerquality / gentle

gentle forced aligner
https://lowerquality.com/gentle/
MIT License
1.41k stars 292 forks source link

Increasing accuracy on Local Build #88

Open saxenauts opened 7 years ago

saxenauts commented 7 years ago

It is reasonable that on my local machine (4GB RAM) the accuracy of alignment is somewhat jittered. There's an offset of 0.02 - 0.04 seconds between the gentle server and my local build. Compare the CSV generated on gentle server with the CSV generated on my local build.

An example with 20 - 160 ms offset.

Gentle : because 37.74 37.88 Local : because 37.72 37.88

Another example with an offset of 2 seconds

Gentle : they're 22.26 22.46 Local : they're 20.66 20.88

I am sorry, for the naive requests that follow, I just started exploring Kaldi as a tool. I have no prior experience with ASR systems.

What can I do to increase the accuracy on my local build?

Now, I need these timestamps to do a research project. Specifically, I need to segment the audio on the basis of word boundaries. And gentle was the best available tool, from a developer's perspective. As I am not even a beginner in ASR and other such tools.

I believe that if I hire an amazon instance, this will not be a problem. But they are quite expensive. Also, can anyone direct me if there is any other language model that might work better for English?
Meanwhile I will dive into the code, to understand it better.

Thanks

strob commented 7 years ago

Are you running the code locally from source, or from a DMG release? Mac or Linux? On Fri, Jul 29, 2016 at 10:29 AM Utkarsh Saxena notifications@github.com wrote:

It is reasonable that on my local machine (4GB RAM) the accuracy of alignment is somewhat jittered. There's an offset of 0.02 - 0.04 seconds between the gentle server and my local build. Compare the CSV generated on gentle server https://www.dropbox.com/s/03j3mpiiuaij5nq/align_gentle.csv?dl=0 with the CSV generated on my local build https://www.dropbox.com/s/h1o66p79lgnl8ru/align_local.csv?dl=0.

An example with 20 - 40 ms offset.

Gentle : because 37.74 37.88 Local : because 37.72 37.88

Another example with an offset of 2 seconds

Gentle : they're 22.26 22.46 Local : they're 20.66 20.88

I am sorry, for the naive requests that follow, I just started exploring Kaldi as a tool. I have no prior experience with ASR systems.

Now, I need these timestamps to do a research project. Specifically, I need to segment the audio on the basis of word boundaries. And gentle was the best available tool, from a developer's perspective. As I am not even a beginner in ASR and other such tools.

I believe that if I hire an amazon instance, this will not be a problem. But they are quite expensive. Also, can anyone direct me if there is any other language model that might work better for English?

Meanwhile I will dive into the code, to understand it better.

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lowerquality/gentle/issues/88, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMup4leRZBWuz8jVawE1n-Wy1N2OX6Sks5qabnrgaJpZM4JX-wg .

saxenauts commented 7 years ago

Linux. Yes, Locally from Source.

saxenauts commented 7 years ago

@strob : ping!

I tried to understand the codebase of gentle, and a lot of things make sense now. But still, the quality of locally done alignments don't match up to the alignments done on gentle server. I still don't know if building kaldi with cuda enabled will help. Can you guide me on this? My primary objective is to create a pronunciation database for words. Is there anything else that can be done? Probably a four gram, or a five gram model?

strob commented 7 years ago

There's no reason I can think of that alignment would be more accurate on the server. Please make sure you're using the latest version of Gentle and have compiled all other dependencies as instructed.

On Sun, Aug 14, 2016, 5:30 PM Utkarsh Saxena notifications@github.com wrote:

@strob https://github.com/strob : ping!

I tried to understand the codebase of gentle, and a lot of things make sense now. But still, the quality of locally done alignments don't match up to the alignments done on gentle server. I still don't know if building kaldi with cuda enabled will help. Can you guide me on this? My primary objective is to create a pronunciation database for words. Is there anything else that can be done? Probably a four gram, or a five gram model?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/lowerquality/gentle/issues/88#issuecomment-239679723, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMup1E4HNHk70yxsLw9afJHB4IhXGBzks5qfzSggaJpZM4JX-wg .