srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
822 stars 343 forks source link

Implementing RNN training on TIMIT database #128

Open razor1179 opened 7 years ago

razor1179 commented 7 years ago

Following guidelines provided in another issue https://github.com/srvk/eesen/issues/59 I generated the TLG.fst with tokens.txt forT.fst having the contents:

<eps> 0
<blk> 1
aa 2
ae 3
ah 4
ao 5
aw 6
ax 7
ay 8
b 9
ch 10
cl 11
d 12
dh 13
dx 14
eh 15
el 16
en 17
epi 18
er 19
ey 20
f 21
g 22
hh 23
ih 24
ix 25
iy 26
jh 27
k 28
l 29
m 30
n 31
ng 32
ow 33
oy 34
p 35
r 36
s 37
sh 38
sil 39
t 40
th 41
uh 42
uw 43
v 44
vcl 45
w 46
y 47
z 48
zh 49
#0 50
#1 51

The L.fst uses the tokens.txt, words.txt and lexiconp_disambig.txt. The words.txt contains:

<eps> 0
aa 1
ae 2
ah 3
ao 4
aw 5
ax 6
ay 7
b 8
ch 9
cl 10
d 11
dh 12
dx 13
eh 14
el 15
en 16
epi 17
er 18
ey 19
f 20
g 21
hh 22
ih 23
ix 24
iy 25
jh 26
k 27
l 28
m 29
n 30
ng 31
ow 32
oy 33
p 34
r 35
s 36
sh 37
sil 38
t 39
th 40
uh 41
uw 42
v 43
vcl 44
w 45
y 46
z 47
zh 48
#0 49

and lexiconp_disambig.txt contains:

aa  1.0 aa
ae  1.0 ae
ah  1.0 ah
ao  1.0 ao
aw  1.0 aw
ax  1.0 ax
ay  1.0 ay
b   1.0 b
ch  1.0 ch
cl  1.0 cl
d   1.0 d
dh  1.0 dh
dx  1.0 dx
eh  1.0 eh
el  1.0 el
en  1.0 en
epi 1.0 epi
er  1.0 er
ey  1.0 ey
f   1.0 f
g   1.0 g
hh  1.0 hh
ih  1.0 ih
ix  1.0 ix
iy  1.0 iy
jh  1.0 jh
k   1.0 k
l   1.0 l
m   1.0 m
n   1.0 n
ng  1.0 ng
ow  1.0 ow
oy  1.0 oy
p   1.0 p
r   1.0 r
s   1.0 s
sh  1.0 sh
sil 1.0 sil
t   1.0 t
th  1.0 th
uh  1.0 uh
uw  1.0 uw
v   1.0 v
vcl 1.0 vcl
w   1.0 w
y   1.0 y
z   1.0 z
zh  1.0 zh

and finally using the lm_phone_bg.arpa.gz and words.txt I get G.fst and compose the final TLG.fst.

Now my question is how many labels should I provide the FST while decoding? There are 48 phonemes including silence sil and no blanks in the text file to get priors for anything but the 48 phonemes. Please let me know, as I get PER of 102% when I feed 48 probabilities to the decoding graph.

fmetze commented 7 years ago

You will need to provide 48 phones during decoding, I believe, which means 49 symbols including blank. You should be able to generate the priors using the standard call in the training script:

Compute the occurrence counts of labels in the label sequences. These counts will be used to

derive prior probabilities of the labels.

gunzip -c $dir/labels.tr.gz | awk '{line=$0; gsub(" "," 0 ",line); print line " 0";}' | \ analyze-counts --verbose=1 --binary=false ark:- $dir/label.counts >& $dir/log/compute_label_counts.log || exit 1

If this generates a count of 1 for some phones (sil?), that is ok. If this produces 0 or 1 for blank, because each sentence only contains one symbol, then you should manually edit the blank count to be the same as the sum count for all the other symbols.

On Mar 28, 2017, at 5:47 PM, razor1179 notifications@github.com wrote:

Following guidelines provided in another issue https://github.com/srvk/eesen/issues/59 <x-msg://228/url> I generated the TLG.fst with tokens.txt forT.fst having the contents:

0 1 aa 2 ae 3 ah 4 ao 5 aw 6 ax 7 ay 8 b 9 ch 10 cl 11 d 12 dh 13 dx 14 eh 15 el 16 en 17 epi 18 er 19 ey 20 f 21 g 22 hh 23 ih 24 ix 25 iy 26 jh 27 k 28 l 29 m 30 n 31 ng 32 ow 33 oy 34 p 35 r 36 s 37 sh 38 sil 39 t 40 th 41 uh 42 uw 43 v 44 vcl 45 w 46 y 47 z 48 zh 49 #0 50 #1 51 The L.fst uses the tokens.txt, words.txt and lexiconp_disambig.txt. The words.txt contains: 0 aa 1 ae 2 ah 3 ao 4 aw 5 ax 6 ay 7 b 8 ch 9 cl 10 d 11 dh 12 dx 13 eh 14 el 15 en 16 epi 17 er 18 ey 19 f 20 g 21 hh 22 ih 23 ix 24 iy 25 jh 26 k 27 l 28 m 29 n 30 ng 31 ow 32 oy 33 p 34 r 35 s 36 sh 37 sil 38 t 39 th 40 uh 41 uw 42 v 43 vcl 44 w 45 y 46 z 47 zh 48 #0 49 and lexiconp_disambig.txt contains: aa 1.0 aa ae 1.0 ae ah 1.0 ah ao 1.0 ao aw 1.0 aw ax 1.0 ax ay 1.0 ay b 1.0 b ch 1.0 ch cl 1.0 cl d 1.0 d dh 1.0 dh dx 1.0 dx eh 1.0 eh el 1.0 el en 1.0 en epi 1.0 epi er 1.0 er ey 1.0 ey f 1.0 f g 1.0 g hh 1.0 hh ih 1.0 ih ix 1.0 ix iy 1.0 iy jh 1.0 jh k 1.0 k l 1.0 l m 1.0 m n 1.0 n ng 1.0 ng ow 1.0 ow oy 1.0 oy p 1.0 p r 1.0 r s 1.0 s sh 1.0 sh sil 1.0 sil t 1.0 t th 1.0 th uh 1.0 uh uw 1.0 uw v 1.0 v vcl 1.0 vcl w 1.0 w y 1.0 y z 1.0 z zh 1.0 zh and finally using the lm_phone_bg.arpa.gz and words.txt I get G.fst and compose the final TLG.fst. Now my question is how many labels should I provide the FST while decoding? There are 48 phonemes including silence sil and no blanks in the text file to get priors for anything but the 48 phonemes. Please let me know. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread .

Florian Metze http://www.cs.cmu.edu/directory/florian-metze Associate Research Professor Carnegie Mellon University

razor1179 commented 7 years ago

@fmetze, I have given an example of the data in TIMIT to obtain labels below. This is for one utterance, faem0_sx312 sil dh ow z ae n s er z w el vcl b iy s cl t r ey f ao w er dx ix f y uw th iy ng cl k dh ix m th r uw cl k eh r f el iy f er s cl t sil which translates to labels: faem0_sx312 37 11 31 46 1 29 35 17 46 44 14 43 7 24 35 9 38 34 18 19 3 44 17 12 23 19 45 41 39 24 30 9 26 11 23 28 39 34 41 9 26 13 34 19 14 24 19 17 35 9 38 37

From the above example there is no mention of a blank and sil is considered as a phoneme, hence replaced by label 37 above and everywhere else. Also on creating the label.counts file I get 48 values [ 146177.5 2292.5 2266.5 1865.5 728.5 3892.5 1934.5 2181.5 820.5 12518.5 2432.5 2376.5 1864.5 3277.5 951.5 630.5 908.5 4138.5 2271.5 2215.5 1191.5 1660.5 4248.5 7370.5 4626.5 1013.5 3794.5 4425.5 3566.5 6896.5 1220.5 1653.5 304.5 2588.5 4681.5 6176.5 1317.5 8283.5 3948.5 745.5 500.5 1952.5 1994.5 7219.5 2216.5 995.5 3682.5 149.5 ]

So you're suggesting I manually add the sum of all these values which is 284170.0, as the first value of label.counts to make it 49? Could you please let me know why this is required?

fmetze commented 7 years ago

Are the analyze-counts executables of Kaldi and Eesen different? All the count files that I saw from Eesen contain integer numbers, and the first number is very large, because it contains the sum of all the other counts. This one seems to be different?

razor1179 commented 7 years ago

Yes I believe they are different, after sourcing the path in EESEN and running the command I was able to get the label counts as [ 146177 2292 2266 1865 728 3892 1934 2181 820 12518 2432 2376 1864 3277 951 630 908 4138 2271 2215 1191 1660 4248 7370 4626 1013 3794 4425 3566 6896 1220 1653 304 2588 4681 6176 1317 8283 3948 745 500 1952 1994 7219 2216 995 3682 149 ] but that is still 48 counts not 49.

razor1179 commented 7 years ago

My mistake, the lexicon_numbers.txt I created, started with 0 instead of 1, I've fixed it now. As I move forward, I will need to feed only the 48 labeled probabilities to the TLG.fst? Also while calculating the prior probabilities to divide the output of my RNN do I not use the count for blank or leave it out? Please let me know.

razor1179 commented 7 years ago

@fmetze, I noticed that for generating the L.fst EESEN uses the tokens.txt rather than units.txt for --isymbols option. I am quite unfamiliar with FST generation so is this by design or can it be changed? Or is it a requirement as you are using tokens to generate the FST T.fst? If so can I just use the units.txt to create the first FST?

fmetze commented 7 years ago

tokens.fst contains the epsilon "token" as well as the disambiguation symbols (#1, #2, ...) which you will need to insert into the FST to be able to make words unique. If you only have a units.txt you can probably use utils/add_lex_disambig.pl to find out how many and whcih disambiguation symbols you need - but I haven't tried this.