Closed srush closed 3 years ago
I propose the following notation for merging and splitting, akin to renaming of tensor axes.
Given the following:
we write for such reshaping.
As an example, given , we'd like to perform the following reshaping:
(height x width x channel) => (arbitrary order hidden bunch)
This results in the feature dimension with type .
Or, since the index set can be anything (not restricting to 0..n
), we can just let the index set be a tuple of (height, weight, channel) for the feature axis. This just retains the order of the dimensions:
This results in the feature dimension with type .
This is a nice description. I guess there are two questions:
1) Do we need a separate notation with \mapsto or can you just just B{feature: } = A{height:, width: , channel: }
2) @davidweichiang doesn't like the automatic brunching by memory layout order, and I am a bit uncomfortable with tuple dimensions. There was a question of whether we could do your second example with a specifc unspecified bijection \rho and what the notation for that might be.
In any case, I think that this notation should end up looking uniform with the renaming notation.
I don't like having to think about memory layout, but the Wikipedia page I linked says that in math there's a standard order, so that made me feel less hesitant about it.
It would be good to think about what the most important use cases are. I hope that some of the use cases would be eliminated by better notation for higher-order tensors.
Technically, \rho is a bijection between ind(shape1) and ind(shape2), right? Not shape1 and shape2 (which would be good enough for renaming but not reshaping)?
I like the idea of leaving \rho unspecified or underspecified, and having the option of defining a specific \rho if needed.
This is a nice description. I guess there are two questions:
1. Do we need a separate notation with \mapsto or can you just just B_{feature: } = A_{height:, width: , channel: }
\mapsto
is just the mathematician's way of defining a lambda (I try to make this uniform with renaming). I like your suggestion ofB_{feature: } = A_{height:, width: , channel: }
.2. @davidweichiang doesn't like the automatic brunching by memory layout order, and I am a bit uncomfortable with tuple dimensions. There was a question of whether we could do your second example with a specifc unspecified bijection \rho and what the notation for that might be.
My personal feeling towards this is you have to choose from one of these. You have to specify a bijection
between the index of the source shapes, and the index of the target shapes:
where F is the domain of the values of the new feature
axis, and
,
lambda h, w, c: h*W*C + w*C + c
into F,Technically, \rho is a bijection between ind(shape1) and ind(shape2), right? Not shape1 and shape2 (which would be good enough for renaming but not reshaping)?
I like the idea of leaving \rho unspecified or underspecified, and having the option of defining a specific \rho if needed.
Yes, I have defined \rho
as a bijection between ind(S)
and ind(T)
. (\rho
as a generalized renaming operator). In specific contexts, the actual function can be underspecified: just specifying their input/output types would be enough for conveying the intent. Something like
that just specifies the type would be enough.
. Something like that just specifies the type would be enough.
I really like this (not surprisingly..) the mapping from triples to features depends on their types and we don't need to talk about the particular lengths.
So it sounds like everyone is fairly happy with A_\rho, where \rho is a bijection from ind(S) to ind(T) where S \subseteq shape(A) and T is orthogonal to shape(A) \ S. (Orthogonal was introduced in a recent version and just means they have no index names in common.) There are two ways of writing \rho inline:
and
Should square brackets around A be required or optional?
Yeah the first group is underspecified by just declaring the type signature; the second group is fully specified since the mapping function is written out. I'm happy with this proposal.
This is nice. I think we can drop the brackets as well? This makes the multiheaded attention a bit cleaner.
One annoying thing is that we are 1 indexed (I assume). So the reindexing is a bit messier than this example.
Wow that was too easy. How about this: I feel slightly bothered that this notation looks like indexing. But there is a weak analogy: If in indexing the tensor is applied to the subscript, in reshaping the tensor is composed with the subscript, which is close. However, we are actually composing with \rho^-1, not \rho. It would look terrible to write the reshaping in the other direction (new index -> old index), right? How about (old index <- new index)? Or are there any notations that look less like indexing?
I like this notation of old shape -> new shape.
It's intuitive to understand. It corresponds to casting the tensor from the old type to the new type and we are applying the casting transformation on the tensor. It is true that to apply it we compose the inverse transformation on its inputs indices but that's a second order point.
We can put it in superscript if it is confusing with indexing
We can borrow the "rename" operator in relational algebra:
Admittedly I don't like it very much.
Notation for papers should be one that people can intuitively guess the meaning without needing to look it up. I think this relational operator fails that test, but the notation A_{ old -> new } or A^{ old -> new } passes it
I like this notation of old shape -> new shape.
It's intuitive to understand. It corresponds to casting the tensor from the old type to the new type and we are applying the casting transformation on the tensor. It is true that to apply it we compose the inverse transformation on its inputs indices but that's a second order point.
We can put it in superscript if it is confusing with indexing
Agree with putting it in the superscript: the function composition is inverse to indexing as @davidweichiang pointed out, and should probably be written differently than normal indexing.
This means that we need to also change the notation of renaming a single axis.
Between superscript and subscript, I actually still like subscript better. The superscript position is useful and I think it's better to keep it open.
e7f0e3f3d5ebf529a85e320a6e1ae66e29980c8f adds a minimal definition of reshaping. How strong is the need for specifying the ordering using \mapsto?
For example, are we not happy with this way of defining max-pooling?
Request: we want a way to combine dimensions that doesn't rely on any assumptions on memory layout or ordering. Instead of flattening we are going to use the word "bunching".
Example: In the 2nd to last layer of MNist CNN (height x width x channel) => (arbitrary order hidden bunch) . All that matters is that the order be consistent. But it is non-meaningful.
Ideas:
Mainly wait for @boazbk
Dims that are set dimensions (as opposed to order dimension) shouldn't be allowed to be indexed by numbers?