Rewrite ARPA-FST compiler

kkm000 commented 8 years ago

Discussion: https://groups.google.com/d/topic/kaldi-developers/m_xloj88tsk/discussion

danpovey commented 8 years ago

Just a note: Guoguo wrote some arpa-reading code in the code that creates the 'const-arpa' object. That code is maybe not perfect, but it's a lot better than the old arpa2fst code. Some of this could be re-used, and regardless, we should try to keep them in sync as much as possible.

kkm000 commented 8 years ago

I think I'll do two PRs for this, to make reviewing easier: one factoring out the ARPA file parser into a separate object from const-arpa, and another with the new compiler using the parser, and tests.

danpovey commented 8 years ago

OK, cool.

On Mon, Jan 11, 2016 at 4:22 PM, Kirill Katsnelson <notifications@github.com

wrote:

I think I'll do two PRs for this, to make reviewing easier: one factoring out the ARPA file parser into a separate object from const-arpa, and another with the new compiler using the parser, and tests.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/422#issuecomment-170741335.

kkm000 commented 8 years ago

@danpovey: What do you think I should keep of kadli-lm and kaldi-lm-table? There is a class LangModelFst that does not seem completed to me. The intention was to make it read and write grammars in different formats, namely ARPA, some textual representation of WFST and something else defined in an enum but never mentioned in the code. Do you want to keep that, or just care about the code needed to compile ARPA file to an FST, i. e. what the arpa2fst does, and this is it?

Also, I understand IRST LM support is not needed any more, correct?

danpovey commented 8 years ago

No, that stuff is very old and there is no plan to continue that- we have other mechanisms now. Just implement arpa2fst. dan

On Mon, Jan 11, 2016 at 8:44 PM, Kirill Katsnelson <notifications@github.com

wrote:

@danpovey https://github.com/danpovey: What do you think I should keep of kadli-lm and kaldi-lm-table? There is a class LangModelFst that does not seem completed to me. The intention was to make it read and write grammars in different formats, namely ARPA, some textual representation of WFST and something else defined in an enum but never mentioned in the code. Do you want to keep that, or just care about the code needed to compile ARPA file to an FST, i. e. what the arpa2fst does, and this is it?

Also, I understand IRST LM support is not needed any more, correct?

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/422#issuecomment-170788844.

kkm000 commented 8 years ago

Another question. It seems trivially easy at this point to extend arpa2fst to optionally take words.txt and thus eliminate a separate utils/find_arpa_oovs.pl pass (and not call remove_oovs.pl in the next pass), and also essentially do utils/eps2disambig.pl | utils/s2eps.pl during the FST construction. Do you think it is a good idea to add options to do just that? I mean, instead of

gunzip -c $lm | utils/find_arpa_oovs.pl $out_dir/words.txt \
  > $out_dir/oovs_${lm_base}.txt

gunzip -c $lm \
  | egrep -v '<s> <s>|</s> <s>|</s> </s>' \
  | arpa2fst - | fstprint \
  | utils/remove_oovs.pl $out_dir/oovs_${lm_base}.txt \
  | utils/eps2disambig.pl | utils/s2eps.pl \
  | fstcompile --isymbols=$out_dir/words.txt --osymbols=$out_dir/words.txt \
    --keep_isymbols=false --keep_osymbols=false

could do just

gunzip -c $lm \
  | arpa2fst --symbols=$out_dir/words.txt --disambig=#0 --s2eps=true

No options, old behavior. Will that be useful?

danpovey commented 8 years ago

Sure, makes sense I guess- just make sure it would work for arbitrary utf8-encoded unicode.

Dan

On Mon, Jan 11, 2016 at 10:20 PM, Kirill Katsnelson < notifications@github.com> wrote:

Another question. It seems trivially easy at this point to extend arpa2fst to optionally take words.txt and thus eliminate a separate utils/ find_arpa_oovs.pl pass (and not call remove_oovs.pl in the next pass), and also essentially do utils/eps2disambig.pl | utils/s2eps.pl during the FST construction. Do you think it is a good idea to add options to do just that? I mean, instead of

gunzip -c $lm | utils/find_arpa_oovs.pl $out_dir/words.txt \

$outdir/oovs${lm_base}.txt

gunzip -c $lm \ | egrep -v ' ~~| |~~ ' \ | arpa2fst - | fstprint \ | utils/remove_oovs.pl $outdir/oovs${lm_base}.txt \ | utils/eps2disambig.pl | utils/s2eps.pl \ | fstcompile --isymbols=$out_dir/words.txt --osymbols=$out_dir/words.txt \ --keep_isymbols=false --keep_osymbols=false

could do just

gunzip -c $lm \ | arpa2fst --symbols=$out_dir/words.txt --disambig=#0 --s2eps=true

No options, old behavior. Will that be useful?

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/422#issuecomment-170810502.

kkm000 commented 8 years ago

Good point about utf8, thanks--I'll make sure to have tests for that too.

kkm000 commented 8 years ago

Question on performance vs. complexity of code. For some, likely most practical LM sizes (up to and including 4-gram models with up to 2 million words) it is possible to speed up compilation by the factor of about 2, as my tests show, by packaging 3 21-bit integer symbol indices into a single 64-bit word when using symbol sequence as the index into an associative array of states.

Pros: double+ speed compiling most models

Con: double (not exactly) implementation (larger LMs also must be supported). Bit twiddling code less fun to read.

So the question is, should I worry at all with the optimized implementation, or just use the "slow" but virtually size-unlimited in all cases? I think compiling ARPA LM is not the most invoked operation in Kaldi, so my inclination is not to worry about this optimization, but I still wanted to mention the option.

danpovey commented 8 years ago

Is this just used in computing a hash value? Can you show me some of the code?

On Wed, Feb 3, 2016 at 6:20 PM, Kirill Katsnelson notifications@github.com wrote:

Question on performance vs. complexity of code. For some, likely most practical LM sizes (up to and including 4-gram models with up to 2 million words) it is possible to speed up compilation by the factor of about 2, as my tests show, by packaging 3 21-bit integer symbol indices into a single 64-bit word when using symbol sequence as the index into an associative array of states.

Pros: double+ speed compiling most models

Con: double (not exactly) implementation (larger LMs also must be supported). Bit twiddling code less fun to read.

So the question is, should I worry at all with the optimized implementation, or just use the "slow" but virtually size-unlimited in all cases? I think compiling ARPA LM is not the most invoked operation in Kaldi, so my inclination is not to worry about this optimization, but I still wanted to mention the option.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/422#issuecomment-179525355.

kkm000 commented 8 years ago

Not only for hashing, it is actually a map: symbol sequence -> state. It is needed in both compilers (the main difference is a super-compact fst representation in const-arpa model, if I understand). In arpa2fst, the map uses actual concatenated words. I think it should be very slow, but it is the part I am rewriting completely. In const-arpa-lm, a very similar map actually exists

danpovey commented 8 years ago

Hm. I'm not sure. I'd rather not have the code too complicated, but if it's a significant speedup it would be OK I guess. Dan

On Wed, Feb 3, 2016 at 6:39 PM, Kirill Katsnelson notifications@github.com wrote:

Not only for hashing, it is actually a map: symbol sequence -> state. It is needed in both compilers (the main difference is a super-compact fst representation in const-arpa model, if I understand). In arpa2fst, the map uses actual concatenated words https://github.com/kaldi-asr/kaldi/blob/master/src/lm/kaldi-lmtable.cc#L46. I think it should be very slow, but it is the part I am rewriting completely. In const-arpa-lm, a [very similar map actually exists]( https://github.com/kaldi-asr/kaldi/blob/master/src/lm/const-arpa-lm.cc#L262

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/422#issuecomment-179533310.

kkm000 commented 8 years ago

Ah, I think I understand your question now. The thing is the key in unordered_map. The longest vector is (grammar order)-1, so for 4-grams it is 3; this can be packed into one word leaving 21 bit per symbol.

kkm000 commented 8 years ago

Ok, I'll do both.

kkm000 commented 8 years ago

I had a first stab at part 2 of the fix. I am not sure if added templated code intricacy is worth the return I am getting on it. To give some numbers, the following is from compiling a 8M 3-gram model: with bin/arpafst as a baseline, then the new code without the map optimization (vector key), and lastly with the optimization (packed key).

	bin/arpafst (old)	New, vector	New, packed
Run time, s	48	47	35
Memory, MB	1200	960	650
Page faults, ×10³	290	230	150

The overall improvement is not that high. Although in an isolated test map lookups are 2+ times faster, this appears not to be the end of it, just one hot spot.

The code is more complex that it would be without the 2 paths. Since the decision has to be made after APRA header is read, the implementation class, templated on the key type, is instantiated at runtime, and a lot of parameters is passed to it. The key types are also verbose to the verge of ugliness (or bring your own verge: the code looks to me plain ugly and hard to follow).

So, is this 25% CPU and 33% RAM performance improvement worth keeping this dual implementation? If we drop the fork in key type and always use the "general" key, I'd kill 1/3 of code and all templated types in the file. @danpovey, @jtrmal, what do you think?

jtrmal commented 8 years ago

I think it should be considered because of the memory footprint. Havent look at the code yet. y

On Fri, Feb 5, 2016 at 6:43 PM, Kirill Katsnelson notifications@github.com wrote:

I had a first stab at part 2 of the fix https://github.com/kkm000/kaldi/blob/arpa-2/src/lm/arpa-lm-compiler.cc. I am not sure if added templated code intricacy is worth the return I am getting on it. To give some numbers, the following is from compiling a 8M 3-gram model: with bin/arpafst as a baseline, then the new code without the map optimization (vector key), and lastly with the optimization (packed key). bin/arpafst (old) New, vector New, packed Run time, s 48 47 35 Memory, MB 1200 960 650 Page faults, ×10³ 290 230 150

The overall improvement is not that high. Although in an isolated test map lookups are 2+ times faster, this appears not to be the end of it, just one hot spot.

The code is more complex that it would be without the 2 paths. Since the decision has to be made after APRA header is read, the implementation class, templated on the key type, is instantiated at runtime https://github.com/kkm000/kaldi/blob/arpa-2/src/lm/arpa-lm-compiler.cc#L251, and a lot of parameters is passed to it. The key types are also verbose to the verge of ugliness https://github.com/kkm000/kaldi/blob/arpa-2/src/lm/arpa-lm-compiler.cc#L27 (or bring your own verge: the code looks to me plain ugly and hard to follow).

So, is this 25% CPU and 33% RAM performance improvement worth keeping this dual implementation? If we drop the fork in key type and always use the "general" key, I'd kill 1/3 of code and all templated types in the file. @danpovey https://github.com/danpovey, @jtrmal https://github.com/jtrmal, what do you think?

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/422#issuecomment-180625730.

jtrmal commented 8 years ago

Looked at the code -- saw much worse :) It's not that much more complicated. To me, it seems ok. y.

On Fri, Feb 5, 2016 at 6:56 PM, Jan Trmal jtrmal@gmail.com wrote:

I think it should be considered because of the memory footprint. Havent look at the code yet. y

On Fri, Feb 5, 2016 at 6:43 PM, Kirill Katsnelson < notifications@github.com> wrote:

I had a first stab at part 2 of the fix https://github.com/kkm000/kaldi/blob/arpa-2/src/lm/arpa-lm-compiler.cc. I am not sure if added templated code intricacy is worth the return I am getting on it. To give some numbers, the following is from compiling a 8M 3-gram model: with bin/arpafst as a baseline, then the new code without the map optimization (vector key), and lastly with the optimization (packed key). bin/arpafst (old) New, vector New, packed Run time, s 48 47 35 Memory, MB 1200 960 650 Page faults, ×10³ 290 230 150

The overall improvement is not that high. Although in an isolated test map lookups are 2+ times faster, this appears not to be the end of it, just one hot spot.

The code is more complex that it would be without the 2 paths. Since the decision has to be made after APRA header is read, the implementation class, templated on the key type, is instantiated at runtime https://github.com/kkm000/kaldi/blob/arpa-2/src/lm/arpa-lm-compiler.cc#L251, and a lot of parameters is passed to it. The key types are also verbose to the verge of ugliness https://github.com/kkm000/kaldi/blob/arpa-2/src/lm/arpa-lm-compiler.cc#L27 (or bring your own verge: the code looks to me plain ugly and hard to follow).

So, is this 25% CPU and 33% RAM performance improvement worth keeping this dual implementation? If we drop the fork in key type and always use the "general" key, I'd kill 1/3 of code and all templated types in the file. @danpovey https://github.com/danpovey, @jtrmal https://github.com/jtrmal, what do you think?

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/422#issuecomment-180625730.

danpovey commented 8 years ago

Also if it's hidden in the .cc file then it's not too bad-- is it hidden in the .cc file?

Some parts need more documentation, e.g what does Tails() do? And I don't like that you define typedef float Weight; as this is confusable with Weight in OpenFst. I'd rather you just use float directly.

Dan

On Fri, Feb 5, 2016 at 6:56 PM, jtrmal notifications@github.com wrote:

I think it should be considered because of the memory footprint. Havent look at the code yet. y

On Fri, Feb 5, 2016 at 6:43 PM, Kirill Katsnelson < notifications@github.com> wrote:

I had a first stab at part 2 of the fix <https://github.com/kkm000/kaldi/blob/arpa-2/src/lm/arpa-lm-compiler.cc . I am not sure if added templated code intricacy is worth the return I am getting on it. To give some numbers, the following is from compiling a 8M 3-gram model: with bin/arpafst as a baseline, then the new code without the map optimization (vector key), and lastly with the optimization (packed key). bin/arpafst (old) New, vector New, packed Run time, s 48 47 35 Memory, MB 1200 960 650 Page faults, ×10³ 290 230 150

The overall improvement is not that high. Although in an isolated test map lookups are 2+ times faster, this appears not to be the end of it, just one hot spot.

The code is more complex that it would be without the 2 paths. Since the decision has to be made after APRA header is read, the implementation class, templated on the key type, is instantiated at runtime < https://github.com/kkm000/kaldi/blob/arpa-2/src/lm/arpa-lm-compiler.cc#L251 , and a lot of parameters is passed to it. The key types are also verbose to the verge of ugliness < https://github.com/kkm000/kaldi/blob/arpa-2/src/lm/arpa-lm-compiler.cc#L27

(or bring your own verge: the code looks to me plain ugly and hard to follow).

So, is this 25% CPU and 33% RAM performance improvement worth keeping this dual implementation? If we drop the fork in key type and always use the "general" key, I'd kill 1/3 of code and all templated types in the file. @danpovey https://github.com/danpovey, @jtrmal https://github.com/jtrmal, what do you think?

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/422#issuecomment-180625730.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/422#issuecomment-180628216.

kkm000 commented 8 years ago

Ok, let's keep it then. This is draft, so I'll add more comments and try to make more readable otherwise. So far, I treated it as more of a throwaway excercise.

Re typedef Weight, using a float is ok by me, but I would keep StateId and Symbol distinct, as they are too easy to confuse with each other if just called int32. I initially though about writing typedef fst::StdVectorFst::Arc::Weight Weight, but then this is a float, and everything would unravel if it would not be (rather, not implicitly convertible to and from float, but that's a technicality). So it is essentially the same FST Weight, only spelled explicitly.

kkm000 commented 8 years ago

Oh, and the .h file is totally clean, all the complexity is in the .cc: https://github.com/kkm000/kaldi/blob/arpa-2/src/lm/arpa-lm-compiler.h

danpovey commented 8 years ago

But fst::StdVectorFst::Arc::Weight is not a float-- it's a class containing a float. That's why it's a bit confusing for me.

On Fri, Feb 5, 2016 at 7:08 PM, Kirill Katsnelson notifications@github.com wrote:

Ok, let's keep it then. This is draft, so I'll add more comments and try to make more readable otherwise. So far, I treated it as more of a throwaway excercise.

Re typedef Weight, using a float is ok by me, but I would keep StateId and Symbol distinct, as they are too easy to confuse with each other if just called int32. I initially though about writing typedef fst::StdVectorFst::Arc::Weight Weight, but then this is a float, and everything would unravel if it would not be (rather, not implicitly convertible to and from float, but that's a technicality). So it is essentially the same FST Weight, only spelled explicitly.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/422#issuecomment-180630783.

kkm000 commented 8 years ago

D-oh! Right. Then it is confusing, indeed!

kaldi-asr / kaldi

Rewrite ARPA-FST compiler #422