Tensor Bunching - Githubissues

srush commented 3 years ago

Request: we want a way to combine dimensions that doesn't rely on any assumptions on memory layout or ordering. Instead of flattening we are going to use the word "bunching".

Example: In the 2nd to last layer of MNist CNN (height x width x channel) => (arbitrary order hidden bunch) . All that matters is that the order be consistent. But it is non-meaningful.

Ideas:

Mainly wait for @boazbk
Dims that are set dimensions (as opposed to order dimension) shouldn't be allowed to be indexed by numbers?

davidweichiang commented 3 years ago

https://en.wikipedia.org/wiki/Tensor_reshaping

ctongfei commented 3 years ago

I propose the following notation for merging and splitting, akin to renaming of tensor axes.

Given the following:

A tensor $A \in R^{i_1: X_1, \cdots, i_D: X_D}$ ;
Axes to be reshaped: $\{i_{a_1}: X_{a_1}, \cdots, i_{a_m}: X_{b_n}\}$ ;
Axes to be reshaped into: $\{j_1: Y_1, \cdots, j_n: Y_n\}$ , where $\prod_{k=1}^m \left|X_{a_k}\right| = \prod_{k=1}^n \left|Y_k\right|$ (that guarantees the number of elements stays the same);
Index reshaping bijection $\rho: \{i_{a_1}: X_{a_1}, \cdots, i_{a_m}: X_{b_n}\} \to \{j_1: Y_1, \cdots, j_n: Y_n\}$ :

we write $A_\rho$ for such reshaping.

As an example, given $A \in \mathbb{R}^{\mathsf{height}:[H], \mathsf{width}:[W], \mathsf{channel}: [C]}$ , we'd like to perform the following reshaping:

(height x width x channel) => (arbitrary order hidden bunch)
$A_{\mathsf{height}:h, \mathsf{width}:w, \mathsf{channel}: c ~\mapsto~ \mathsf{feature}: (hWC+wC+c)}$

This results in the feature dimension with type $[H\times W\times C]$ .

Or, since the index set can be anything (not restricting to 0..n), we can just let the index set be a tuple of (height, weight, channel) for the feature axis. This just retains the order of the dimensions:

$A_{\mathsf{height}:h, \mathsf{width}:w, \mathsf{channel}: c ~\mapsto~ \mathsf{feature}: (h, w, c)}$

This results in the feature dimension with type $[H]\times [W]\times [C]$ .

srush commented 3 years ago

This is a nice description. I guess there are two questions:

1) Do we need a separate notation with \mapsto or can you just just B{feature: } = A{height:, width: , channel: }

2) @davidweichiang doesn't like the automatic brunching by memory layout order, and I am a bit uncomfortable with tuple dimensions. There was a question of whether we could do your second example with a specifc unspecified bijection \rho and what the notation for that might be.

davidweichiang commented 3 years ago

In any case, I think that this notation should end up looking uniform with the renaming notation.

I don't like having to think about memory layout, but the Wikipedia page I linked says that in math there's a standard order, so that made me feel less hesitant about it.

It would be good to think about what the most important use cases are. I hope that some of the use cases would be eliminated by better notation for higher-order tensors.

Technically, \rho is a bijection between ind(shape1) and ind(shape2), right? Not shape1 and shape2 (which would be good enough for renaming but not reshaping)?

I like the idea of leaving \rho unspecified or underspecified, and having the option of defining a specific \rho if needed.

ctongfei commented 3 years ago

This is a nice description. I guess there are two questions:
1. Do we need a separate notation with \mapsto or can you just just B_{feature: } = A_{height:, width: , channel: }
\mapsto is just the mathematician's way of defining a lambda (I try to make this uniform with renaming). I like your suggestion of B_{feature: } = A_{height:, width: , channel: }.
2. @davidweichiang doesn't like the automatic brunching by memory layout order, and I am a bit uncomfortable with tuple dimensions. There was a question of whether we could do your second example with a specifc unspecified bijection \rho and what the notation for that might be.

My personal feeling towards this is you have to choose from one of these. You have to specify a bijection between the index of the source shapes, and the index of the target shapes: $\rho: [H] \times [W] \times [C] \to F$ where F is the domain of the values of the new feature axis, and $|F| = H\times W\times C$ ,

Either you let $F = [H \times W \times C]$ , then you write a mapping like lambda h, w, c: h*W*C + w*C + c into F,
Or, you let $F = [H] \times [W] \times [C]$ , with a tuple-valued index.

ctongfei commented 3 years ago

Technically, \rho is a bijection between ind(shape1) and ind(shape2), right? Not shape1 and shape2 (which would be good enough for renaming but not reshaping)?

I like the idea of leaving \rho unspecified or underspecified, and having the option of defining a specific \rho if needed.

Yes, I have defined \rho as a bijection between ind(S) and ind(T). (\rho as a generalized renaming operator). In specific contexts, the actual function can be underspecified: just specifying their input/output types would be enough for conveying the intent. Something like $A_{(\mathsf{height}, \mathsf{width}, \mathsf{channel}) \to \mathsf{feature}}$ that just specifies the type would be enough.

boazbk commented 3 years ago

. Something like that just specifies the type would be enough.

I really like this (not surprisingly..) the mapping from triples to features depends on their types and we don't need to talk about the particular lengths.

davidweichiang commented 3 years ago

So it sounds like everyone is fairly happy with A_\rho, where \rho is a bijection from ind(S) to ind(T) where S \subseteq shape(A) and T is orthogonal to shape(A) \ S. (Orthogonal was introduced in a recent version and just means they have no index names in common.) There are two ways of writing \rho inline:

foo \rightarrow bar (renaming, preserving order if there's a natural way to do so)
foo, bar \rightarrow baz (flatten with unspecified order)
foo \rightarrow bar, baz (unflatten...with unspecified order??)

and

foo(f), bar(b) \mapsto baz(f*n+b)
foo(b*n+z) \mapsto bar(b), baz(z)

Should square brackets around A be required or optional?

ctongfei commented 3 years ago

Yeah the first group is underspecified by just declaring the type signature; the second group is fully specified since the mapping function is written out. I'm happy with this proposal.

srush commented 3 years ago

This is nice. I think we can drop the brackets as well? This makes the multiheaded attention a bit cleaner.

One annoying thing is that we are 1 indexed (I assume). So the reindexing is a bit messier than this example.

davidweichiang commented 3 years ago

Wow that was too easy. How about this: I feel slightly bothered that this notation looks like indexing. But there is a weak analogy: If in indexing the tensor is applied to the subscript, in reshaping the tensor is composed with the subscript, which is close. However, we are actually composing with \rho^-1, not \rho. It would look terrible to write the reshaping in the other direction (new index -> old index), right? How about (old index <- new index)? Or are there any notations that look less like indexing?

boazbk commented 3 years ago

I like this notation of old shape -> new shape.

It's intuitive to understand. It corresponds to casting the tensor from the old type to the new type and we are applying the casting transformation on the tensor. It is true that to apply it we compose the inverse transformation on its inputs indices but that's a second order point.

We can put it in superscript if it is confusing with indexing

ctongfei commented 3 years ago

We can borrow the "rename" operator in relational algebra:

$\rho_{(\mathsf{height}, \mathsf{width}, \mathsf{channel}) \to \mathsf{feature}}A$

Admittedly I don't like it very much.

boazbk commented 3 years ago

Notation for papers should be one that people can intuitively guess the meaning without needing to look it up. I think this relational operator fails that test, but the notation A_{ old -> new } or A^{ old -> new } passes it

ctongfei commented 3 years ago

I like this notation of old shape -> new shape.

It's intuitive to understand. It corresponds to casting the tensor from the old type to the new type and we are applying the casting transformation on the tensor. It is true that to apply it we compose the inverse transformation on its inputs indices but that's a second order point.

We can put it in superscript if it is confusing with indexing

Agree with putting it in the superscript: the function composition is inverse to indexing as @davidweichiang pointed out, and should probably be written differently than normal indexing.

ctongfei commented 3 years ago

This means that we need to also change the notation of renaming a single axis.

davidweichiang commented 3 years ago

Between superscript and subscript, I actually still like subscript better. The superscript position is useful and I think it's better to keep it open.

davidweichiang commented 3 years ago

e7f0e3f3d5ebf529a85e320a6e1ae66e29980c8f adds a minimal definition of reshaping. How strong is the need for specifying the ordering using \mapsto?

For example, are we not happy with this way of defining max-pooling?

namedtensor / notation

Tensor Bunching #12