Open akbar20gh opened 6 years ago
it depends. you can start from any of the existing data preparation scripts. look at the local/prepare_data.sh scripts in the individual recipes, and see which one you can most easily to your own data - it's impossible to predict without more information which one will be the easiest to adapt.
thanks if I used data and lang used in KALDI, it's work? what are differences between data preparation of KALDI and EESEN?
and my second question means, after data preparation for running code(train and decode) which egs is useful?
no idea about data preparation? all of data preparation of egs are local. not general
You're right, there is no general format that data sets must conform to; for each data format, there are local, non-general preparation steps. Switchboard, WSJ, Tedlium are all different. For your data set, some adaptation of data preparation from existing examples is required, unless it happens to be (or can be processed into being) in the format of one of the examples. However: a generalized example would make a nice addition to Eesen!
Just a few things that may differ between your data and the examples include:
As Florian mentioned earlier, without more information (such as above) about your data, we cannot give much additional help. These steps in Tedlium, for example, do data prep:
# Use the same data preparation script from Kaldi
local/tedlium_prepare_data.sh --data-dir db/TEDLIUM_release2 || exit 1
# Construct the phoneme-based lexicon
local/tedlium_prepare_phn_dict.sh || exit 1;
# Compile the lexicon and token FSTs
utils/ctc_compile_dict_token.sh data/local/dict_phn data/local/lang_phn_tmp data/lang_phn || exit 1;
# Compose the decoding graph
local/tedlium_decode_graph.sh data/lang_phn || exit 1;
In this example, the CMUDict phone set and phonetic pronunciation dictionary are used, as well as the CMUSphinx language model. All included in the TEDLIUM data download.
thanks all as I did first I create directory data/local/dict_char data/local/dict_char$ tree . ├── lexicon.txt ├── units_nosil.txt └── units.txt
my lexicon.txt is words (space) characters /data/local/dict_char$ less lexicon.txt
hi I want to run eesen on my own dataset. I want to use LSTM-CTC network. what is data preparation need? what egs I can use and helpfull?