data perepration - Githubissues

akbar20gh commented 6 years ago

hi I want to run eesen on my own dataset. I want to use LSTM-CTC network. what is data preparation need? what egs I can use and helpfull?

fmetze commented 6 years ago

it depends. you can start from any of the existing data preparation scripts. look at the local/prepare_data.sh scripts in the individual recipes, and see which one you can most easily to your own data - it's impossible to predict without more information which one will be the easiest to adapt.

akbar20gh commented 6 years ago

thanks if I used data and lang used in KALDI, it's work? what are differences between data preparation of KALDI and EESEN?

and my second question means, after data preparation for running code(train and decode) which egs is useful?

akbar20gh commented 6 years ago

no idea about data preparation? all of data preparation of egs are local. not general

riebling commented 6 years ago

You're right, there is no general format that data sets must conform to; for each data format, there are local, non-general preparation steps. Switchboard, WSJ, Tedlium are all different. For your data set, some adaptation of data preparation from existing examples is required, unless it happens to be (or can be processed into being) in the format of one of the examples. However: a generalized example would make a nice addition to Eesen!

Just a few things that may differ between your data and the examples include:

Language
Dictionary / lexicon
Language model
(Human) transcribed audio in text format
Phone or token set (used by Dictionary / lexicon)
Audio format / filenaming convention

As Florian mentioned earlier, without more information (such as above) about your data, we cannot give much additional help. These steps in Tedlium, for example, do data prep:

  # Use the same data preparation script from Kaldi
  local/tedlium_prepare_data.sh --data-dir db/TEDLIUM_release2 || exit 1

  # Construct the phoneme-based lexicon
  local/tedlium_prepare_phn_dict.sh || exit 1;

  # Compile the lexicon and token FSTs
  utils/ctc_compile_dict_token.sh data/local/dict_phn data/local/lang_phn_tmp data/lang_phn || exit 1;

  # Compose the decoding graph
local/tedlium_decode_graph.sh data/lang_phn || exit 1;

In this example, the CMUDict phone set and phonetic pronunciation dictionary are used, as well as the CMUSphinx language model. All included in the TEDLIUM data download.

akbar20gh commented 6 years ago

thanks all as I did first I create directory data/local/dict_char data/local/dict_char$ tree . ├── lexicon.txt ├── units_nosil.txt └── units.txt

my lexicon.txt is words (space) characters /data/local/dict_char$ less lexicon.txt

]/ ] / ]/, ] / , ]/. ] / . ]a,ab ] a , a b units.txt, is units with numbers /local/dict_char$ less units.txt 0 1 2 3 ' 4 , 5 . 6 / 7 and units_nosil.txt, is units that not silent. then give lexicon.txt to utils/sym2int.pl to make lexicon_numbers.txt utils/sym2int.pl -f 2- data/local/dict_char/units.txt < data/local/dict_char/lexicon.txt > data/local/dict_char/lexicon_numbers.txt then make lang_char by give data/local/dict_char to utils/ctc_compile_dict_token.sh utils/ctc_compile_dict_token.sh --dict-type "char" --space-char "" \ data/local/dict_char data/local/lang_char_tmp data/lang_char next create directory data/local/nist_lm and put language model in ARPA form "lm.arpa.gz" in it and change local/decode_graph.sh data/lang_char to make TLG.fst. i think it's all of data preparation,maybe!

srvk / eesen

data perepration #177