Comment about nested tensor vs. ragged tensor

Guys, this is a very general comment and FYI...

To some extent you guys seem to be viewing NestedTensor as a generic ragged-tensor data structure, similar to TensorFlow's RaggedTensor. I understand that PyTorch historically came from a computer-vision application, and you guys are viewing it from that perspective and your nested tensor data structure fulfils a need there. But I want to encourage you not to view this NestedTensor as filling up the RaggedTensor "slot" in the design of PyTorch, but instead, keep open the possibility of using essentially the same design as TensorFlow for implementing ragged tensors.

Here's my perspective: I'm working on this project "k2" https://github.com/k2-fsa/k2 which implements differentiable algorithms on finite state automata and transducers. I independently came up with a design that's basically equivalent to TensorFlow's RaggedTensor, involving row_splits and row_ids and linear arrays of values (I originally had different names for them, but renamed them for compatibility with the TF notation). This turns out to be a very generic and useful design, especially when you are using it as a backbone for quite general algorithms.

Although in k2 we implement differentiable FSA algorithms, the differentiability is a red herring when it comes to understanding the underlying design, because we provide differentiability from the "top down", by understanding the behavior of the whole algorithm, rather than making individual operations differentiable. Most nontrivial low-level algorithms or operations in k2 are actually operating on tensors of integers or of structs, and are not naturally differentiable. It wouldn't be very feasible or efficient to do what we are doing by using the PyTorch or TensorFlow approach of defining low-level operations and having the user string them together in Python. Instead we implement nontrivial algorithms like FSA intersection in C++/CUDA, issuing special purpose kernels that make use of the RaggedTensor data structure and primitive operations that it supports.

The point which I feel is relevant for what you guys are doing, is: don't dismiss the row_ids/row_splits approach, because it's a data structure that is very well suited for GPU implementation and can naturally be implemented with tools like cub. There are a lot of primitive operations you can implement on this kind of RaggedTensor, such as various kinds of indexing, concatenation, reduction and so on, that map easily to GPU implementation and that are very general-purpose.

Hello @danpovey,

Thank you very much for this post and your insights! If I understand you correctly this could potentially be addressed by using NestedTensor as a frontend for more than just one layout. I detailed some use-cases for that in a separate RFC. Roughly the layouts listed there are masked (the usual padding + masking), packed (tensor memory stacked top of each other linearly in memory), PackedSequence (includes cudnn's custom layout for RNNs probably best described by this comment) and a plain list (std::vector of Tensors). The row_ids/row_splits layout can enter the rank of these (or potentially could even be a replacement for PackedSequence under some conditions if I'm not mistaken) and is indeed very useful, especially in bootstrapping efficient implementations in the way you describe.

Thank you, Christian

pytorch / nestedtensor

Comment about nested tensor vs. ragged tensor #344