Multi-database English LVCSR recipe

vijayaditya commented 8 years ago

We would like to design a recipe which combines

fisher+swbd (2100 hours)
tedlium (120 hr or 200 hr)
librispeech (1000 hr)
AMI (100 hr * 8 distant microphone + 100 hr close talk microphone = 900 hr)
WSJ (80 hr) tasks for the acoustic model (AM) and language mode (LM) training.

This is an advanced task which requires experience with data preparation stage, lexicon creation and LM training.

This task would involve evaluating the system on Hub 2000 eval set, RT03 eval set, Librispeech test set, AMI-(SDM/MDM/IHM) eval set and Tedlium test set.

The AM will be built using the chain (lattice-free MMI) objective function. The LM will be built using the RNN-LM toolkit.

sikoried commented 8 years ago

@guoguo12 and I will look into this in the next weeks :-)

vijayaditya commented 8 years ago

@guoguo12 @sikoried Could you please provide timely updates on this progress ? This would help us schedule the other projects which rely on this recipe.

sikoried commented 8 years ago

We've started working on it, but Allen has finals the upcoming week. Is end of summer still ok, or should we expedite a bit?

vijayaditya commented 8 years ago

Nothing urgent. Our previous plan to get the recipe in shape by mid-August is still good for our plans. I just wanted to ensure that you had everything needed from our end.

guoguo12 commented 8 years ago

Yep, we're all set. We'll provide updates here as the project proceeds.

My working branch is guoguo12:multi-recipe. There's not much to see there right now. If you'd like, I can make a "WIP" pull request.

vijayaditya commented 8 years ago

@guoguo12 Thanks. A WIP PR is always desirable.

guoguo12 commented 8 years ago

@vijayaditya: When using existing data prep scripts, e.g. egs/fisher_swbd/s5/local/swbd1_data_prep.sh, should we 1) symlink/reference the script from our recipe or 2) make a copy of the script in our recipe's directory?

@danpovey: Would like your input on this as well.

danpovey commented 8 years ago

Regarding copying data-prep scripts vs. linking them... not sure. Maybe copy them-- linking things within local isn't really the normal pattern.

Dan

On Fri, May 20, 2016 at 7:52 PM, Allen Guo notifications@github.com wrote:

@vijayaditya https://github.com/vijayaditya: When using existing data prep scripts, e.g. egs/fisher_swbd/s5/local/swbd1_data_prep.sh, should we 1) symlink the script from our recipe or 2) make a copy of the script in our recipe's directory?

@danpovey https://github.com/danpovey: Would like your input on this as well.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-220743985

guoguo12 commented 8 years ago

Okay, thanks. The downside is that the copies may need to be manually synced if the originals are changed. That said, @sikoried thought of another solution (Option #3): Allow the user to specify (by commit hash) what version of each script to use, then automatically pull those versions from GitHub using sparse checkout.

danpovey commented 8 years ago

I think that's too complicated. Bear in mind that if the upstream scripts get changed, they may be changed in ways that are incompatible with the recipe you develop. So it may be safer to force the manual syncing. Dan

On Fri, May 20, 2016 at 8:06 PM, Allen Guo notifications@github.com wrote:

Okay, thanks. The downside is that the copies may need to be manually synced if the originals are changed. That said, @sikoried https://github.com/sikoried thought of another solution (Option #3 https://github.com/kaldi-asr/kaldi/pull/3): Allow the user to specify (by commit hash) what version of each script to use, then automatically pull those versions from GitHub using sparse checkout.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-220745342

guoguo12 commented 8 years ago

Okay, we'll stick with copying. Thanks!

vijayaditya commented 8 years ago

It is always be good just to mention in the comment at the top of the script where it was copied from (along with the commit id) and what was changed compared to the previous script.

We are trying to do this in the new nnet3 scripts in local/. It would be good to follow this for data prep scripts too.

Vijay On May 21, 2016 05:45, "Allen Guo" notifications@github.com wrote:

Okay, we'll stick with copying. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-220746231

guoguo12 commented 8 years ago

Status update: Finished copying and integrating the data prep scripts for the various corpora. The data directories should be ready to be combined now. I've also normalized the transcripts, and I've created a lexicon by combining the lexicons from fisher_swbd and tedlium. Next, I'll look into generating missing pronunciations in librispeech using Sequitur G2P and making those part of the lexicon as well.

danpovey commented 8 years ago

I think it might be a better idea to use CMUDict as the upstream lexicon for the combined data, maybe using the Switchboard (MSU) and Cantab lexicons, suitably mapped, for some unseen words. Of course you can try both ways.

Recently we tried a combined setup with Librispeech and Switchboard, and found the pronunciations obtained by g2p after training on the Switchboard lexicon were worse than the CMUDict-derived pronunciations.

This will require some mapping. I believe the MSU pronunciations are not quite the same as what you get from removing stress from CMUDict. Samuel mentioned that one of the phones is spelled differently. It looks to me like the Cantab lexicon is an extension of CMUDict (after stress removal).

On Wed, May 25, 2016 at 9:31 PM, Allen Guo notifications@github.com wrote:

Status update: Finished copying and integrating the data prep scripts for the various corpora. The data directories should be ready to be combined now. I've also normalized the transcripts, and I've created a lexicon by combining the lexicons from fisher_swbd and tedlium. Next, I'll look into generating missing pronunciations in librispeech using Sequitur G2P and making those part of the lexicon as well.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-221754858

guoguo12 commented 8 years ago

So basically, 1) train a G2P model using CMUDict (after stress removal), 2) synthesize pronunciations for all words across all databases that are not in CMUDict, and 3) combine into a single lexicon?

danpovey commented 8 years ago

Something like that, but it might possibly be better to use the prons in the cantab dictionary and in the MSU dictionary before you go to g2p, so only use g2p as a last resort. Before you do that, though, you'd need to work out how to map MSU pronunciations to a CMUDict-like format.

On Wed, May 25, 2016 at 10:23 PM, Allen Guo notifications@github.com wrote:

So basically, 1) train a G2P model using CMUDict (after stress removal), 2) synthesize pronunciations for all words across all databases that are not in CMUDict, and 3) combine into a single lexicon?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-221761076

guoguo12 commented 8 years ago

Okay, I'll give it a shot. Thanks for the feedback!

drTonyRobinson commented 8 years ago

On 26/05/16 02:40, Daniel Povey wrote:

It looks to me like the Cantab lexicon is an extension of CMUDict (after stress removal).

Just for clarity, this is correct.

Our main objective was to add a decent LM to TEDLIUM. That needed pronunciations to run and IIRC at the time Kaldi wasn't using g2p.

Tony

Speechmatics is a trading name of Cantab Research Limited We are hiring: www.speechmatics.com/careers https://www.speechmatics.com/careers Dr A J Robinson, Founder, Cantab Research Ltd Phone direct: 01223 794096, office: 01223 794497 Company reg no GB 05697423, VAT reg no 925606030 51 Canterbury Street, Cambridge, CB4 3QG, UK

vince62s commented 8 years ago

Hi, For the users who do not have 1) either the computing resource or 2) access to some of the corpus, I would suggest to open the choice to pick or not some sources in a preliminary step, unless this is not at the goal of this project.

guoguo12 commented 8 years ago

@vince62s: Yep, @sikoried and I are taking that into consideration. Our current plan is to make the training steps as generic as possible. Our proposed data directory structure is:

data/
  multi_a/
    train_s1/  # data directory for stage 1 of training
    train_s2/  # data directory for stage 2 of training
    train_s3/  # data directory for stage 3 of training, etc.
  ...

Each train_s* directory will be configurable in terms of what corpora to include, what ratios of those corpora to use, and perhaps even what model parameters to use. We will provide an example build script that generates multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the build script, you'll be able to build your own multi_b directory that has the same structure as multi_a but different combinations of corpora at each step. And then we'll make the training steps independent of what corpora are being used: they will simply use multi_b instead of multi_a, as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or add corpora if you have additional data. It will also make it easy to evaluate how different combinations of data affect the results, e.g. "Does using 50% WSJ and 50% SWBD for the initial monophone model work better than using just WSJ?"

We think this solution strikes a good balance between hard-coding too much and over-automating too much. Suggestions are appreciated!

vijayaditya commented 8 years ago

Writing the data prep scripts to handle multiple corpora is good, but keep in mind that things like model size (number of layers, size of layers, number of leaves,...), number of epochs of training and many other hyper-parameters are usually adjusted based on data size. So it would be bit tough to make these configurable while keeping the scripts simple to read. The scripts we keep in local are preferably hard-coded as they are provided as example scripts, which are copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so that they can handle any combinations of databases. Provide multiple run_*.sh scripts each of which operates on a particular database combination and has all the hyper-parameters tuned for that particular data size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried https://github.com/sikoried and I are taking that into consideration. Our current plan is to make the training steps as generic as possible. Our proposed data directory structure is:

data/ multi_a/ train_s1/ # data directory for stage 1 of training train_s2/ # data directory for stage 2 of training train_s3/ # data directory for stage 3 of training, etc. ...

Each train_s* directory will be configurable in terms of what corpora to include, what ratios of those corpora to use, and perhaps even what model parameters to use. We will provide an example build script that generates multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the build script, you'll be able to build your own multi_b directory that has the same structure as multi_a but different combinations of corpora at each step. And then we'll make the training steps independent of what corpora are being used: they will simply use multi_b instead of multi_a, as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or add corpora if you have additional data. It will also make it easy to evaluate how different combinations of data affect the results, e.g. "Does using 50% WSJ and 50% SWBD for the initial monophone model work better than using just WSJ?"

We think this solution strikes a good balance between hard-coding too much and over-automating too much. Suggestions are appreciated!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222862679, or mute the thread https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK .

jtrmal commented 8 years ago

Guys, I don't think tuning it for every possible combination is a good use of our and Allen&Korbinian's time. I'd say let's just make the scripts reasonably general and set-up the infrastructure for the original corpora as outlined by Vijay. I don't actually think it will be easy to make it work for all of them right away. We can then distribute the models through openslr if there will be enough interest (so people can use the models even without having bought the corpora) -- but let's not just try to cater go anyone's selection of the corpora they have, that's not our mission. y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti < notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but keep in mind that things like model size (number of layers, size of layers, number of leaves,...), number of epochs of training and many other hyper-parameters are usually adjusted based on data size. So it would be bit tough to make these configurable while keeping the scripts simple to read. The scripts we keep in local are preferably hard-coded as they are provided as example scripts, which are copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so that they can handle any combinations of databases. Provide multiple run_*.sh scripts each of which operates on a particular database combination and has all the hyper-parameters tuned for that particular data size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried https://github.com/sikoried and I are taking that into consideration.

Our current plan is to make the training steps as generic as possible. Our proposed data directory structure is:

data/ multi_a/ train_s1/ # data directory for stage 1 of training train_s2/ # data directory for stage 2 of training train_s3/ # data directory for stage 3 of training, etc. ...

Each train_s* directory will be configurable in terms of what corpora to include, what ratios of those corpora to use, and perhaps even what model parameters to use. We will provide an example build script that generates multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the build script, you'll be able to build your own multi_b directory that has the same structure as multi_a but different combinations of corpora at each step. And then we'll make the training steps independent of what corpora are being used: they will simply use multi_b instead of multi_a, as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or add corpora if you have additional data. It will also make it easy to evaluate how different combinations of data affect the results, e.g. "Does using 50% WSJ and 50% SWBD for the initial monophone model work better than using just WSJ?"

We think this solution strikes a good balance between hard-coding too much and over-automating too much. Suggestions are appreciated!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222862679, or mute the thread < https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222912202, or mute the thread https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK .

sikoried commented 8 years ago

Yeah we didn't intend to optimize parameters for more than one combination, but rather architect the scripts so that it's fairly easy for someone else to make modifications. Maybe to give a negative example: right now in other recipes it's fairly distributed and hard coded which partitions and params (states, Gaussians, jobs, and the partitions themselves) are used for the bootstrap. If you want to change something, you need to carefully go through a long script and not miss a thing. With the abstraction of partitions for stages and a more tidy/organized bootstrap training script, itll be easy for someone else to change the list of included datasets, and make the according param changes.

Korbinian.

On Wed, Jun 1, 2016, 01:01 jtrmal notifications@github.com wrote:

Guys, I don't think tuning it for every possible combination is a good use of our and Allen&Korbinian's time. I'd say let's just make the scripts reasonably general and set-up the infrastructure for the original corpora as outlined by Vijay. I don't actually think it will be easy to make it work for all of them right away. We can then distribute the models through openslr if there will be enough interest (so people can use the models even without having bought the corpora) -- but let's not just try to cater go anyone's selection of the corpora they have, that's not our mission. y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti < notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but keep in mind that things like model size (number of layers, size of layers, number of leaves,...), number of epochs of training and many other hyper-parameters are usually adjusted based on data size. So it would be bit tough to make these configurable while keeping the scripts simple to read. The scripts we keep in local are preferably hard-coded as they are provided as example scripts, which are copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so that they can handle any combinations of databases. Provide multiple run_*.sh scripts each of which operates on a particular database combination and has all the hyper-parameters tuned for that particular data size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried https://github.com/sikoried and I are taking that into consideration.

Our current plan is to make the training steps as generic as possible. Our proposed data directory structure is:

data/ multi_a/ train_s1/ # data directory for stage 1 of training train_s2/ # data directory for stage 2 of training train_s3/ # data directory for stage 3 of training, etc. ...

Each train_s* directory will be configurable in terms of what corpora to include, what ratios of those corpora to use, and perhaps even what model parameters to use. We will provide an example build script that generates multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the build script, you'll be able to build your own multi_b directory that has the same structure as multi_a but different combinations of corpora at each step. And then we'll make the training steps independent of what corpora are being used: they will simply use multi_b instead of multi_a, as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or add corpora if you have additional data. It will also make it easy to evaluate how different combinations of data affect the results, e.g. "Does using 50% WSJ and 50% SWBD for the initial monophone model work better than using just WSJ?"

We think this solution strikes a good balance between hard-coding too much and over-automating too much. Suggestions are appreciated!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222862679 , or mute the thread <

https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222912202, or mute the thread < https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK

.

—

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222920883, or mute the thread https://github.com/notifications/unsubscribe/ADhueOwJiNlLOo5Zlj71TkkbYzU3S-L6ks5qHTxhgaJpZM4IHCKK .

jtrmal commented 8 years ago

Yeah, Ok, but do not over-engineer it. Y On Jun 1, 2016 4:00 PM, "Korbinian" notifications@github.com wrote:

Yeah we didn't intend to optimize parameters for more than one combination, but rather architect the scripts so that it's fairly easy for someone else to make modifications. Maybe to give a negative example: right now in other recipes it's fairly distributed and hard coded which partitions and params (states, Gaussians, jobs, and the partitions themselves) are used for the bootstrap. If you want to change something, you need to carefully go through a long script and not miss a thing. With the abstraction of partitions for stages and a more tidy/organized bootstrap training script, itll be easy for someone else to change the list of included datasets, and make the according param changes.

Korbinian.

On Wed, Jun 1, 2016, 01:01 jtrmal notifications@github.com wrote:

Guys, I don't think tuning it for every possible combination is a good use of our and Allen&Korbinian's time. I'd say let's just make the scripts reasonably general and set-up the infrastructure for the original corpora as outlined by Vijay. I don't actually think it will be easy to make it work for all of them right away. We can then distribute the models through openslr if there will be enough interest (so people can use the models even without having bought the corpora) -- but let's not just try to cater go anyone's selection of the corpora they have, that's not our mission. y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti < notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but keep in mind that things like model size (number of layers, size of layers, number of leaves,...), number of epochs of training and many other hyper-parameters are usually adjusted based on data size. So it would be bit tough to make these configurable while keeping the scripts simple to read. The scripts we keep in local are preferably hard-coded as they are provided as example scripts, which are copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so that they can handle any combinations of databases. Provide multiple run_*.sh scripts each of which operates on a particular database combination and has all the hyper-parameters tuned for that particular data size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried https://github.com/sikoried and I are taking that into consideration.

Our current plan is to make the training steps as generic as possible. Our proposed data directory structure is:

data/ multi_a/ train_s1/ # data directory for stage 1 of training train_s2/ # data directory for stage 2 of training train_s3/ # data directory for stage 3 of training, etc. ...

Each train_s* directory will be configurable in terms of what corpora to include, what ratios of those corpora to use, and perhaps even what model parameters to use. We will provide an example build script that generates multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the build script, you'll be able to build your own multi_b directory that has the same structure as multi_a but different combinations of corpora at each step. And then we'll make the training steps independent of what corpora are being used: they will simply use multi_b instead of multi_a, as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or add corpora if you have additional data. It will also make it easy to evaluate how different combinations of data affect the results, e.g. "Does using 50% WSJ and 50% SWBD for the initial monophone model work better than using just WSJ?"

We think this solution strikes a good balance between hard-coding too much and over-automating too much. Suggestions are appreciated!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222862679 , or mute the thread <

https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222912202 , or mute the thread <

https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK

.

—

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222920883, or mute the thread < https://github.com/notifications/unsubscribe/ADhueOwJiNlLOo5Zlj71TkkbYzU3S-L6ks5qHTxhgaJpZM4IHCKK

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-223001037, or mute the thread https://github.com/notifications/unsubscribe/AKisX2x4_whjpy03p7-ddODvwp90mL_iks5qHZBjgaJpZM4IHCKK .

danpovey commented 8 years ago

Agreed RE not over-engineering it. Scripts with millions of options don't generally make it easier to modify things, they make it harder. Dan

On Wed, Jun 1, 2016 at 10:06 AM, jtrmal notifications@github.com wrote:

Yeah, Ok, but do not over-engineer it. Y

On Jun 1, 2016 4:00 PM, "Korbinian" notifications@github.com wrote:

Yeah we didn't intend to optimize parameters for more than one combination, but rather architect the scripts so that it's fairly easy for someone else to make modifications. Maybe to give a negative example: right now in other recipes it's fairly distributed and hard coded which partitions and params (states, Gaussians, jobs, and the partitions themselves) are used for the bootstrap. If you want to change something, you need to carefully go through a long script and not miss a thing. With the abstraction of partitions for stages and a more tidy/organized bootstrap training script, itll be easy for someone else to change the list of included datasets, and make the according param changes.

Korbinian.

On Wed, Jun 1, 2016, 01:01 jtrmal notifications@github.com wrote:

Guys, I don't think tuning it for every possible combination is a good use of our and Allen&Korbinian's time. I'd say let's just make the scripts reasonably general and set-up the infrastructure for the original corpora as outlined by Vijay. I don't actually think it will be easy to make it work for all of them right away. We can then distribute the models through openslr if there will be enough interest (so people can use the models even without having bought the corpora) -- but let's not just try to cater go anyone's selection of the corpora they have, that's not our mission. y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti < notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but keep in mind that things like model size (number of layers, size of layers, number of leaves,...), number of epochs of training and many other hyper-parameters are usually adjusted based on data size. So it would be bit tough to make these configurable while keeping the scripts simple to read. The scripts we keep in local are preferably hard-coded as they are provided as example scripts, which are copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so that they can handle any combinations of databases. Provide multiple run_*.sh scripts each of which operates on a particular database combination and has all the hyper-parameters tuned for that particular data size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried https://github.com/sikoried and I are taking that into consideration.

Our current plan is to make the training steps as generic as possible. Our proposed data directory structure is:

data/ multi_a/ train_s1/ # data directory for stage 1 of training train_s2/ # data directory for stage 2 of training train_s3/ # data directory for stage 3 of training, etc. ...

Each train_s* directory will be configurable in terms of what corpora to include, what ratios of those corpora to use, and perhaps even what model parameters to use. We will provide an example build script that generates multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the build script, you'll be able to build your own multi_b directory that has the same structure as multi_a but different combinations of corpora at each step. And then we'll make the training steps independent of what corpora are being used: they will simply use multi_b instead of multi_a, as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or add corpora if you have additional data. It will also make it easy to evaluate how different combinations of data affect the results, e.g. "Does using 50% WSJ and 50% SWBD for the initial monophone model work better than using just WSJ?"

We think this solution strikes a good balance between hard-coding too much and over-automating too much. Suggestions are appreciated!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222862679 , or mute the thread <

https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222912202 , or mute the thread <

https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK

.

—

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-222920883 , or mute the thread <

https://github.com/notifications/unsubscribe/ADhueOwJiNlLOo5Zlj71TkkbYzU3S-L6ks5qHTxhgaJpZM4IHCKK

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-223001037, or mute the thread < https://github.com/notifications/unsubscribe/AKisX2x4_whjpy03p7-ddODvwp90mL_iks5qHZBjgaJpZM4IHCKK

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-223003109, or mute the thread https://github.com/notifications/unsubscribe/ADJVu4h_SF3u8g8VAe24LrkgcRI2UMxqks5qHZH-gaJpZM4IHCKK .

sikoried commented 8 years ago

Sure guys, no worries ;-) It's all way simpler than you may have pictured now!

danpovey commented 8 years ago

I think uppercase vs lowercase doesn't matter, just choose either. In AY vs AY1 AY2, that's a choice of whether you retain the numeric stress markers in CMUDict. You may want to experiment with both options-- except that if the recipe will require the MSU dictionary to help cover the Switchboard data, then you may have to omit stress because the MSU dictcionary doesn't have it. What I am more concerned about is there may be other incompatibilities specific to some phones, e.g. I was told by @xiaohui-zhang that the MSU dictionary spells a particular phone differently, IIRC. After doing the initial conversion for case and stress removal, just look for words that have different prons in the 2 dictionaries, you may find it.

On Thu, Jun 2, 2016 at 8:04 AM, vince62s notifications@github.com wrote:

@guoguo12 https://github.com/guoguo12 not trying to be a pain :) , but looking forward so asking another question RE: lexicon and LM. I saw Dan's comment on mapping various lexicons [as a matter of fact even Librispeech and Tedlium which I think are both derived from CMUdict do not have the same phonemes eg AY / AY1 AY2] some on them being UPPERcase other lowercase (same for LMs), is there one specfic choice on all of this ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-223271657, or mute the thread https://github.com/notifications/unsubscribe/ADJVu98levuG-MmlL2hCeIBSMOINylzLks5qHsa0gaJpZM4IHCKK .

guoguo12 commented 8 years ago

I've been using lowercase phones without stress markers. I didn't use the MSU dictionary. It seems that fewer than 2% of the words (not unique words, but actual words) used in the Switchboard transcripts are absent from CMUdict. Many of these are partial words with hyphens, like "th-" or "tha-".

vijayaditya commented 8 years ago

addressed in #771

vince62s commented 8 years ago

@vijayaditya Did someone actually run the nnet3 and/or chain on this multi-database ?

vijayaditya commented 8 years ago

Not yet.

Vijay

On Oct 6, 2016 09:05, "vince62s" notifications@github.com wrote:

@vijayaditya https://github.com/vijayaditya Did someone actually run the nnet3 and/or chain on this multi-database ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-251894126, or mute the thread https://github.com/notifications/unsubscribe-auth/ADtwoGlOIBL1E0W0w-kSjGitQCO1gWJmks5qxKvGgaJpZM4IHCKK .

migueljette commented 7 years ago

I'm curious. One year later. Has anybody run this through the nnet3 or chain model, @guoguo12 or @sikoried ?

guoguo12 commented 7 years ago

Sorry, I haven't.

migueljette commented 7 years ago

Ok, no worries. Thanks for the quick reply!

galv commented 7 years ago

Hey Allen, long time no talk.

I'd personally be curious if anyone tries nnet3 or chain on this recipe. I get the feeling there's quite a bit of untapped potential here.

On Sun, Sep 17, 2017 at 6:50 AM, Mig notifications@github.com wrote:

Ok, no worries. Thanks for the quick reply!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-330046395, or mute the thread https://github.com/notifications/unsubscribe-auth/AEi_UOqYaDphLZOhsROt1GO2S1ebtYGRks5sjSOggaJpZM4IHCKK .

-- Daniel Galvez

danpovey commented 7 years ago

Some of us at Hopkins are working on this recipe and I think we're going to run the chain recipe soon. I'll try to check in the relevan changes soon.

On Sun, Sep 17, 2017 at 2:41 PM, Daniel Galvez notifications@github.com wrote:

Hey Allen, long time no talk.

I'd personally be curious if anyone tries nnet3 or chain on this recipe. I get the feeling there's quite a bit of untapped potential here.

On Sun, Sep 17, 2017 at 6:50 AM, Mig notifications@github.com wrote:

Ok, no worries. Thanks for the quick reply!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-330046395, or mute the thread https://github.com/notifications/unsubscribe-auth/AEi_ UOqYaDphLZOhsROt1GO2S1ebtYGRks5sjSOggaJpZM4IHCKK .

-- Daniel Galvez

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-330069741, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu39Pe9fvT5tNSznxoYEaTqBNAau0ks5sjWfPgaJpZM4IHCKK .

ananthnagaraj commented 6 years ago

Hi, Just wanted to check if nnet3 or chain run on this recipe.

xiaohui-zhang commented 6 years ago

I'm running experiments and will hopefully commit the recipe soon.

On Mar 24, 2018 2:07 AM, "ananthnagaraj" notifications@github.com wrote:

Hi, Just wanted to check if nnet3 or chain run on this recipe.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/699#issuecomment-375849831, or mute the thread https://github.com/notifications/unsubscribe-auth/ANiEERjeFipkpKg6OMLK5un89U9dVxueks5theKWgaJpZM4IHCKK .

viju2008 commented 5 years ago

I want to combine LibriSpeech , TEDLIUM and Common Voice Can modifying the script help me to achieve this

In fact it would be good if the creators of this recepie could provide it

JiayiFu commented 4 years ago

Sorry for commenting on this out of date issue. Just want to check if any updated results for the chain recipe? @viju2008 @xiaohui-zhang Did you do some experiments?

kaldi-asr / kaldi

Multi-database English LVCSR recipe #699

Tony