srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
824 stars 342 forks source link

tensorflow example? #138

Open jinserk opened 7 years ago

jinserk commented 7 years ago

Hi! I've found that eesen has a tensorflow branch, and I guess that it uses tensorflow's RNN and CTC routine. So I want to test it, but have no idea what the "data_dir" has to contain. I want to use tedlium corpus for the test. Could you give me some ideas for that? Thank you!

riebling commented 7 years ago

Would that be the data_dir referred to in the data preparation instructions? (for example local/tedlium_prepar_data.sh)

#            2014 Brno University of Technology (Author: Karel Vesely)
# Apache 2.0

# To be run from one directory above this script.

data_dir=db/TEDLIUM_release1
;; LABEL "female" "Female" "Female Talkers"
;;'
    # Process the STMs
cat ${data_dir}/$set/stm/*.stm | sort -k1,1 -k2,2 -k4,4n | \

If so, this refers to the audio and transcription data used for both training and testing. The data gets downloaded by the previous script local/tedlium_download_data.sh

jinserk commented 7 years ago

thank you for quick response. but it requires something like labels.tr or labels.cv, when I execute main.py in tf/ctc-train. The path you mentioned contains just a link of original corpus only. the default of data_dir in main.py is /data/ASR5/fmetze/eesen/asr_egs/swbd/v1/tmp.LHhAHROFia/T22/, I have no idea what the tmp dir contains for running main.py.

riebling commented 7 years ago

Ok thought I'd try :) Not familiar with the TF code but I'm sure those who are can help

jinserk commented 7 years ago

Thanks @riebling ! I'm trying to copy labels.tr.gz and labels.cv.gz from exp/train_phn_l5_c320 and gunzip them, which were made from the original training process. I'm not sure it is right way, but will try. :)

fmetze commented 7 years ago

Jinserk,

we'd love nothing more than for you to test the TF branch as well. Yes, it uses TF's LSTM and CTC implementations. In theory, everything is there but most of it is not cleaned up and not in its final place yet. There is a train_ctc_tf.sh in wsj/steps which you can use instead of the regular call to train_ctc_parallel.sh in the standard training script. This script calls "python -m main", so you need to set your PYTHONPATH to point to ~/eesen/tf/ctc-train (where main.py is located).

Which corpus are you working on?

ZhyiXu commented 6 years ago

I have the same problem as you did. I try to run the tf/rnn_lm code but I can not find the data as below: train_fil='./data/turkish_train_text' dev_fil='./data/turkish_dev_text' lex_fil='./data/lexicon_char_system.txt' units_fil='./data/units_char_system.txt' but there is not any data files in the project.

fmetze commented 6 years ago

Sorry all - we have not released a full recipe for this yet. We will probably have one on the Babel corpus very soon, and will be able to release it. From there, it should be easy to port it to other recipes. The main thing is in the training of the neural network language model (before you can test it).