kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.24k stars 5.32k forks source link

using StateId = int64 #3177

Closed shakingWaves closed 5 years ago

shakingWaves commented 5 years ago

@danpovey the normal case: StateId = int32, when change StateId frome int32 to int64, some errors will occur. such as function DiscriminativeSupervisionSplitter::PrepareLattice. if replace int32 with StateId if function DiscriminativeSupervisionSplitter::PrepareLattice, it will be ok. But, Similar problems exist in other functions.

galv commented 5 years ago

I believe OpenFST hard-codes StateId to be an int32 in its code base. Why are you trying to change that type?

danpovey commented 5 years ago

I wouldn't object to those things being changed where you find such errors, since the change would be harmless with OpenFST as it is.

On Thu, Mar 28, 2019 at 10:34 AM Daniel Galvez notifications@github.com wrote:

I believe OpenFST hard-codes StateId to be an int32 in its code base. Why are you trying to change that type?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/3177#issuecomment-477621390, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu2fIML3BTspA_Lx-HIMfM53rsdbjks5vbNMLgaJpZM4cPdmn .

galv commented 5 years ago

I think it is good practice to use StateId instead of int32 if you really meant int32, as this makes the code more generic.

But OpenFST, if you look at fst/arc.h, seems to have using StateId = int; (int is 32-bits, by the way) for all of its arc types. It seems to me that you would need to use an entirely new arc type if you wanted to make the kinds of changes that you are proposing.

shakingWaves commented 5 years ago

@galv @danpovey I intend to use very huge G.fst to produce HCLG,But it will produce many WFST state, So, the state number will exceed the upper limit of int32. replacing int32 with int64 of StateId in arc.h will avoid such problem.

danpovey commented 5 years ago

That's generally not a good idea. It will be super slow to decode. Better to do lattice rescoring or biglm decoding (although biglm decoding is slow).

On Thu, Mar 28, 2019 at 10:17 PM shakingWaves notifications@github.com wrote:

@galv https://github.com/galv @danpovey https://github.com/danpovey I intend to use very huge G.fst to produce HCLG,But it will produce many WFST state, So, the state number will exceed the upper limit of int32. replacing int32 with int64 of StateId in arc.h will avoid such problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/3177#issuecomment-477839765, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu61Wsv1T0uZdp6gOkMFF--I6HIfBks5vbXengaJpZM4cPdmn .

kkm000 commented 5 years ago

the state number will exceed the upper limit of int32.

A meager FST with just 4 billion states (the smallest warranting the extension) and a single arc per state would use up 128GB of memory only for the data it contains, if I drop it all into a big bag of integers and floats and shake it well; this is not counting memory for the data structures to make sense of this pile of bits. Think 200-300GB of RAM realistically. I'd wait till the year 2050 or so before attempting the beam search in such a graph, if I were you, provided the Moore's law holds.