k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.12k stars 213 forks source link

Create separate internal reading functions for K2 FSAs and OpenFST FSAs #207

Closed chenguoguo closed 3 years ago

chenguoguo commented 4 years ago

We will avoid sort. The bottom line, The K2 reading function should be simple and fast, while the OpenFST reading function may have to address corner cases but we will only address those when we see necessary.

danpovey commented 4 years ago

We can just require that the start state be state 0; if necessary we can top-sort the input.

On Sat, Oct 3, 2020 at 11:40 AM Guoguo Chen notifications@github.com wrote:

  • For K2 FSAs, we assume all source states are in non-descending order, and there's only 1 final state with no weight.
  • For OpenFST FSAs, it's possible there will be more than one final states with weight. We assume uses will get the FSTs from fstprint, which should be in order, except the start state.

We will avoid sort. The bottom line, The K2 reading function should be simple and fast, while the OpenFST reading function may have to address corner cases but we will only address those when we see necessary.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2KBUZZSDOX2N62DSTSI2MKHANCNFSM4SCMHUFQ .

csukuangfj commented 4 years ago

Can it be implemented in Python?

danpovey commented 4 years ago

I was thinking that if we ever have to process inputs from OpenFST we can just process them before we deal with them. If necessary we can add some conversion code in C++ or python. Any sorting should be done in C++.

On Sun, Oct 4, 2020 at 1:30 PM Fangjun Kuang notifications@github.com wrote:

Can it be implemented in Python?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/207#issuecomment-703205456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2XSOKUAD6O7MW7GYLSJACBDANCNFSM4SCMHUFQ .

chenguoguo commented 4 years ago

I agree. For OpenFST input I think it’s fair to ask users to run things like topsort if necessary, and then we handle the rest. This OpenFST reading/conversion won’t happen a lot so efficiency won’t be a big issue, but we can probably do that in C++.

On Sat, Oct 3, 2020 at 10:39 PM Daniel Povey notifications@github.com wrote:

I was thinking that if we ever have to process inputs from OpenFST we can just process them before we deal with them. If necessary we can add some conversion code in C++ or python. Any sorting should be done in C++.

On Sun, Oct 4, 2020 at 1:30 PM Fangjun Kuang notifications@github.com wrote:

Can it be implemented in Python?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/207#issuecomment-703205456, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZFLO2XSOKUAD6O7MW7GYLSJACBDANCNFSM4SCMHUFQ

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/207#issuecomment-703206091, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ57E3NQSGIHDYEWD32DSTSJADAXANCNFSM4SCMHUFQ .

danpovey commented 4 years ago

Incidentally, I just realized that OpenFST top-sort doesn't work for us because for cyclic input it leaves the input unchanged. I just noticed that fstcompile seems to reorder the states, though, if the start-state isn't state 0 (??) But I have definitely seen it print things that had start-state != state 0.

printf "1 0 0 0\n 1 1 0 0\n 0 0.0" | ~/kaldi/tools/openfst/bin/fstcompile | ~/kaldi/tools/openfst/bin/fstprint

0 1 0 0

0 0 0 0

1

On Sun, Oct 4, 2020 at 3:04 PM Guoguo Chen notifications@github.com wrote:

I agree. For OpenFST input I think it’s fair to ask users to run things like topsort if necessary, and then we handle the rest. This OpenFST reading/conversion won’t happen a lot so efficiency won’t be a big issue, but we can probably do that in C++.

On Sat, Oct 3, 2020 at 10:39 PM Daniel Povey notifications@github.com wrote:

I was thinking that if we ever have to process inputs from OpenFST we can just process them before we deal with them. If necessary we can add some conversion code in C++ or python. Any sorting should be done in C++.

On Sun, Oct 4, 2020 at 1:30 PM Fangjun Kuang notifications@github.com wrote:

Can it be implemented in Python?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/207#issuecomment-703205456, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAZFLO2XSOKUAD6O7MW7GYLSJACBDANCNFSM4SCMHUFQ

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/207#issuecomment-703206091, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACZ57E3NQSGIHDYEWD32DSTSJADAXANCNFSM4SCMHUFQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/207#issuecomment-703213177, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYDRNRO55FG3IWVHF3SJANBVANCNFSM4SCMHUFQ .

csukuangfj commented 4 years ago

Can we provide some command line tools similar to lhotse that convert openfst format to k2 format offline so that we can keep the c++ code simple?

danpovey commented 4 years ago

Mm, yeah maybe. I'm OK with having the C++ code reorder the states, in principle (e.g. so start state is state 0) but I want the code for reading OpenFST input to be a separate function from the one for k2 input, so it doesn't leak bugs into, or slow down, the simple case.

On Tue, Oct 6, 2020 at 11:04 AM Fangjun Kuang notifications@github.com wrote:

Can we provide some command line tools similar to lhotse that convert openfst format to k2 format offline so that we can keep the c++ code simple?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/207#issuecomment-704000139, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZBW5XZUNCFXBNHJX3SJKCLRANCNFSM4SCMHUFQ .

chenguoguo commented 4 years ago

@danpovey double checking, for WFST

0 1 1 1 0.1
0 2 2 2 0.2
1 2 3 3 0.3
2

We would still create a super final state, because there is no arc entering the original final state with label -1, right?

Regarding the start state, is it required by K2 that there will only be one start state with state 0? If yes we may have to add a super start state as well?

danpovey commented 4 years ago

OK, we can require that in the text form, the final-state must be listed. However I'm not sure what to do about the empty FSA, i.e. with no states. That is actually supported by the tools as-is, but perhaps we could settle on always having at least 2 states (and just supporting an empty FSA as one not having any arcs). (?). Thinking of the pros and cons of this.

There is only one start state, this is the same as in OpenFST.

danpovey commented 4 years ago

... one possibility is to say that for empty FSAs, we just have no arcs, and to make it possible to represent the empty FSA in a vector of FSAs (and write to disk) we actually store the FSA on disk as 2 tensors: one being the row_splits of FSAs -> arcs, equivalent to fsa_vec.RowSplits(2)[fsa_vec.RowSplits(1)], and the 2nd being the tensor of arcs. And the equivalent in text form (although I dont recommend using text form much) would be to represent the empty FSA as just a blank line, with regular FSAs terminated by a single blank line if we serialize multiple FSAs to a single text string. Again, I'm not sure how much we'll want to serialize vectors of FSAs to text strings, i.e. whether this is needed.

chenguoguo commented 4 years ago

I feel we won't be using the text format much. Scenarios that I can think of:

  1. Debugging
  2. Loading pre-built FSAs from other tools (mostly one time thing)

Regarding the empty FSA, I was wondering if it made sense to always require an empty k2 FSA (FSA with no arcs) to have 2 states (0 and 1) with no arc? I noticed that when we represent FsaVec as a tensor, we require the FSAs in that tensor to have states anyways. This way, we can perhaps treat empty FSAs as normal FSAs (with property empty), and represent them with just one tensor: the tensor of arcs. We will have to clean up the document/code (e.g., error flag of FsaVecFromArray1) though as the definition of empty FSAs is not very consistent in the current code.

Guoguo

On Tue, Oct 6, 2020 at 1:38 AM Daniel Povey notifications@github.com wrote:

... one possibility is to say that for empty FSAs, we just have no arcs, and to make it possible to represent the empty FSA in a vector of FSAs (and write to disk) we actually store the FSA on disk as 2 tensors: one being the row_splits of FSAs -> arcs, equivalent to fsa_vec.RowSplits(2)[fsa_vec.RowSplits(1)], and the 2nd being the tensor of arcs. And the equivalent in text form (although I dont recommend using text form much) would be to represent the empty FSA as just a blank line, with regular FSAs terminated by a single blank line if we serialize multiple FSAs to a single text string. Again, I'm not sure how much we'll want to serialize vectors of FSAs to text strings, i.e. whether this is needed.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/207#issuecomment-704120914, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ57E2GDREESQAUUTGQT7TSJLJQ7ANCNFSM4SCMHUFQ .

danpovey commented 4 years ago

of:

  1. Debugging
  2. Loading pre-built FSAs from other tools (mostly one time thing)

Regarding the empty FSA, I was wondering if it made sense to always require an empty k2 FSA (FSA with no arcs) to have 2 states (0 and 1) with no arc?

I'm leaning against this, simply because it's more natural in things like pruning / removing disconnected states, to end up with no states at all. My other PR is modifying the serialized representation to make it possible to represent such empty FSAs.

But for the text form, I think it makes sense to require to specify the final state where there is one, and otherwise we can let the empty string be a valid FSA in text form (with no states).

I noticed that when we represent FsaVec as a tensor, we require the FSAs in that tensor to have states anyways. This way, we can perhaps treat empty FSAs as normal FSAs (with property empty), and represent them with just one tensor: the tensor of arcs. We will have to clean up the document/code (e.g., error flag of FsaVecFromArray1) though as the definition of empty FSAs is not very consistent in the current code.

Guoguo

On Tue, Oct 6, 2020 at 1:38 AM Daniel Povey notifications@github.com wrote:

... one possibility is to say that for empty FSAs, we just have no arcs, and to make it possible to represent the empty FSA in a vector of FSAs (and write to disk) we actually store the FSA on disk as 2 tensors: one being the row_splits of FSAs -> arcs, equivalent to fsa_vec.RowSplits(2)[fsa_vec.RowSplits(1)], and the 2nd being the tensor of arcs. And the equivalent in text form (although I dont recommend using text form much) would be to represent the empty FSA as just a blank line, with regular FSAs terminated by a single blank line if we serialize multiple FSAs to a single text string. Again, I'm not sure how much we'll want to serialize vectors of FSAs to text strings, i.e. whether this is needed.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/207#issuecomment-704120914, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACZ57E2GDREESQAUUTGQT7TSJLJQ7ANCNFSM4SCMHUFQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/207#issuecomment-704623215, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2XBTNJM37OYADYOQLSJOY7PANCNFSM4SCMHUFQ .