Some methods don't work with unconventional lattices

GNroy commented 3 years ago

I'm experimenting with wordpiece-based models and trying to reduce GPU memory consumption when using k2.intersect_dense for denominator computing. For this I'm trying to apply an alternative CTC topology proposed in EESEN paper. It does reduce GPU memory consumption, but some methods, like shortest_path or get_tot_scores, fail on the resulting lattices. I attached a .zip with a notebook to reproduce the issue: alt_ctc_topo.zip My software setup:

Docker image nvcr.io/nvidia/pytorch:21.03-py3 (python 3.8, cuda 11.2, torch 1.9.0a0)
k2 from commit 6326266a1a242901449767a00a08b3e8cb4dc6e9

I would appreciate any help or advice. Thanks!

danpovey commented 3 years ago

Thanks! We are working on other ways to reduce memory consumption when using larger units. One aspect is to use intersect_dense_pruned, which supports having a single shared graph. Also we may create a den graph that only has seen bigrams. Can you show what the error is in this issue so I don't have to look in the archive? (But thanks for sending it.)

On Thu, May 27, 2021 at 7:44 PM Aleksandr Laptev @.***> wrote:

I'm experimenting with wordpiece-based models and trying to reduce GPU memory consumption when using k2.intersect_dense for denominator computing. For this I'm trying to apply an alternative CTC topology proposed in EESEN paper https://arxiv.org/abs/1507.08240. It does reduce GPU memory consumption, but some methods, like shortest_path or get_tot_scores, fail on the resulting lattices. I attached a .zip with a notebook to reproduce the issue: alt_ctc_topo.zip https://github.com/k2-fsa/k2/files/6553562/alt_ctc_topo.zip My software setup:

Docker image nvcr.io/nvidia/pytorch:21.03-py3 (python 3.8, cuda 11.2, torch 1.9.0a0)

k2 from commit 6326266 https://github.com/k2-fsa/k2/commit/6326266a1a242901449767a00a08b3e8cb4dc6e9

I would appreciate any help or advice. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/746, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZCINS23SZTFBZKVLDTPYWBHANCNFSM45UC2G2Q .

GNroy commented 3 years ago

Sure, it is

[F] /k2/k2/csrc/fsa_utils.cu:563:k2::GetStateBatches(k2::FsaVec&, bool)::<lambda(int32_t)> Check failed: dest_state > state_idx01 (1 vs. 1)

[ Stack-Trace: ] /opt/conda/lib/python3.8/site-packages/k2-0.3.4.dev20210518+cuda11.2.torch1.9.0a0-py3.8-linux-x86_64.egg/libk2_log.so(k2::internal::GetStackTraceabi:cxx11 )+0x58) [0x7fa77401e3c8]
/opt/conda/lib/python3.8/site-packages/k2-0.3.4.dev20210518+cuda11.2.torch1.9.0a0-py3.8-linux-x86_64.egg/libk2context.so(k2::internal::Logger::~Logger()+0x58 ) [0x7fa768ae7148]
/opt/conda/lib/python3.8/site-packages/k2-0.3.4.dev20210518+cuda11.2.torch1.9.0a0-py3.8-linux-x86_64.egg/libk2context.so(k2::GetStateBatches(k2::Ragged<k2::A rc>&, bool)+0x411) [0x7fa768bb3671]
/opt/conda/lib/python3.8/site-packages/k2-0.3.4.dev20210518+cuda11.2.torch1.9.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x635cf) [0x7fa7 69dbc5cf]
/opt/conda/lib/python3.8/site-packages/k2-0.3.4.dev20210518+cuda11.2.torch1.9.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x2b1b5) [0x7fa7 69d841b5]

csukuangfj commented 3 years ago

Can you show what the error is in this issue so I don't have to look in the archive? (But thanks for sending it.)

@danpovey You can view the notebook with colab by clicking

https://colab.research.google.com/drive/19Ao_XKRAWdw7xoEdC7eWMhzGlrpMf4d-?usp=sharing

danpovey commented 3 years ago

The comment for that failed assertion said:

        // if the following fails, it's either a code error or the input FSA had                                                                                                                                 
        // cycles.           
        K2_CHECK_GT(dest_state, state_idx01);

... so we should investigate whether the input to that operation had cycles in it (you could print out the properties_str, for instance: _k2.fsa_properties_as_str(fsa.properties)). In fact, the code should have been checking those properties before it even tried to compute the state batches, so it's kind of a bug if we got to that assertion regardless (you could make a PR if you can figure out the fix).

If it had cycles, the question is why, you can perhaps try to trace back and see where they originated. I suspect the issue is, the topo had epsilon self-loops and those matched the epsilon self-loops you had added to the shifted DenseFsaVec. The right thing here might be to call remove_epsilon_self_loops() after that composition. Conceptually this is part of composition, but because we separate things this way, the algorithm can't know whether the matched epsilon self-loops were "legitimate" or not (i.e. whether they arose from self-loops that were really in the original graphs, or self-loops that were added as a device to match epsilons on the other source). In that case you know that they arose from the added epsilon self-loops, because only the topo had "legitimate" epsilon self-loops.

danpovey commented 3 years ago

.. Incidentally, before we implement the fsa_replace operation, we could use a variant of the approach you described to create the den graph containing only seen bigrams.

What I recommend is to designate some 'large' symbol-id, or, say, -2, as blank, and create a version of your topo replacing blank with that value, but also with epsilons for connectivity. We can then compose that topo with the (sparse bigram den graph, with epsilon-self-loops added). The resulting den graph would be intermediate in size between yours, and the full bigram den graph. We could introduce some kind of min-count as needed to keep the den graph to a reasonable size. If you could make a PR for this it would be great.. I don't know whether you might have any constraints on these kinds of contributions.

danpovey commented 3 years ago

... obviously at the end we'd have to replace the -2 or other symbol with 0. That should be doable just by assigning to the .labels property of the Fsa, using PyTorch functions, without any intersection.

galv commented 3 years ago

I don't know whether you might have any constraints on these kinds of contributions.

Hi Dan, dropping in here to say I got legal approval for us at NVIDIA (@GNroy is part of NVIDIA in case you were not already aware) to write code for K2 and release it publicly. As you can tell, I've been less involved than I would ideally like to be (I haven't read your suggestions yet, just skimmed them), but figured I should answer this.

danpovey commented 3 years ago

Great!! Nice to hear it.

On Fri, May 28, 2021 at 12:51 PM Daniel Galvez @.***> wrote:

I don't know whether you might have any constraints on these kinds of contributions.

Hi Dan, dropping in here to say I got legal approval for us at NVIDIA ( @GNroy https://github.com/GNroy is part of NVIDIA in case you were not already aware) to write code for K2 and release it publicly. As you can tell, I've been less involved than I would ideally like to be (I haven't read your suggestions yet, just skimmed them), but figured I should answer this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/746#issuecomment-850128929, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYUTO4SX6G4A435NGTTP4OM7ANCNFSM45UC2G2Q .

danpovey commented 3 years ago

Also, I notice you are using intersect() with treat_epsilons_specially = False. It may be better to call intersect_device() and set sorted_match_a = True, assuming a is arc-sorted. I know this is not a very ideal interface. [EDIT: I now see that if a is arc-sorted, intersect() does automatically add that flag.]

GNroy commented 3 years ago

@danpovey Thank you very much for your suggestions! I will try them and share the results here later.

GNroy commented 3 years ago

It's been a while, but I finally have some updates. I managed to make the EESEN topology work. The problem was with the resulting lattices not being top-sorted. That's why I added another topology that is as compact but produces top-sorted lattices. The current issue is that the resulting lattice varies depending on the device on which the intersections are performed. It is top-sorted in CPU and everything's fine but it's not in GPU. To reproduce the issue you can follow a colab notebook below:

https://colab.research.google.com/drive/1Xa1dUu6kEyufCMfOP7l2GgKTjhKrJdsb?usp=sharing

Thanks!

danpovey commented 3 years ago

Cool! I will respond in more detail to-morrow.

danpovey commented 3 years ago

I am working on the colab stuff and trying to understand what you did. I don't understand why you are doubling the number of symbols with the padding yet.

I think I realized that we can solve this issue much more simply. Suppose we have symbols 1, 2,3, plus epsilon/blank (0). Then we can use the following for the topology, which has way fewer arcs than the conventional one: linear, not quadratic. It is not deterministic, but that should be OK.

0  0  0  0   0.0
0  0  1   1   0.0
0  1   1  1    0.0
0   0  2  2  0.0
0  2  2  2  0.0
0  0  3  3  0.0
0  3  3  3  0.0
0  4  -1 -1  0.0
1  1  1  0   0.0
1  0  1  0  0.0
2  2  2  0  0.0
2  0  2  0  0.0
3  3  3  0  0.0
3  0  3  0  0.0
4

danpovey commented 3 years ago

.. ah, I realize now that this is not quite equivalent to what we currently have, because it allows consecutive symbols without blank in between that correspond to output labels (i.e. not repeats). I believe we found that this makes a small wer difference, but I don't recall whether this was for pure CTC systems or for LF-MMI.

GNroy commented 3 years ago

@danpovey Thanks for looking at the notebook and your suggested topology! I'll try it against other topologies.

I made intersect_dense_padded for two reasons:

I was getting different lattices when I used intersect_dense_eesen (the one with three intersections) on the CPU and GPU. CPU's lattices were top-sorted while GPU's were not. It's most probably a k2 bug/feature. But anyway, I needed an alternative approach for the intersection.
I tried to make the intersection with non-quadratic topos with just one call of k2.intersect_dense while maintaining the same properties (scores, best paths, ...) as those the intersect_dense_eesen delivers.

Actually, none of the topos (except the quadratic one) is correct in general. They all have this insertion issue. And it really does not affect WER much (thanks to the lexicon and LM) at decoding time. So I'm trying to find a topo with a good tradeoff between memory consumption, performance, and lattice properties (mostly scores and best paths).

k2-fsa / k2

Some methods don't work with unconventional lattices #746