Closed srush closed 3 years ago
I want to keep it clear that "axis under operator" means "do the operator to the axis" (not necessarily "do the operator, then sum over the axis"). So if the two operations are merged, I think the symbol chosen should not have a strong association with elementwise product. Maybe \odot is weird enough to not have that association? I'm not sure.
From @mjpost, which turns out to be relevant: https://twitter.com/mjpost/status/1340153497107505152
Does "min X" with no subscript mean "take the min over all elements of X" or "take the trivial elementwise min of X and nothing"?
If min with no name means "global min over all axes", then that would imply dot
with no name means "global dot over all the axes". So thats the opposite of hadamard. That might be enough ambiguity to justify the circle dot operator.
Yeah. It’s not the most logical for “no axes” to mean “all the axes” but maybe it’s the pragmatic choice because NumPy does it that way and because “do the operation on zero axes” is usually just the identity function.
In that case, I think A \cdot B should contract all the axes that A and B have in common (what @ctongfei calls “natural contraction”) but not the “dangling” axes. But I’m not sure about that.
I think that is correct, if we keep the dot notation. but should only be used as a corner case. Many of our examples could be written this way (natural contraction / type inference). However documenting the contraction explicitly is a benefit of our system.
Not to complicate it further, but there are some examples of other cases. I think I actually like these better. Maybe we should ban "min"? I don't love it's type being independent of dimension. You should have to rename down first.
Tensordot for instance would do a tensor product (0 contractions) with empty () here:
https://numpy.org/doc/stable/reference/generated/numpy.tensordot.html
Sympy separates these operations entirely.
https://docs.sympy.org/latest/modules/tensor/array.html
Their style is more circle dot (they write it circle times), and then contract.
So, if we went in the other direction, we would say that \min A, \max A, \sum A, where A is a tensor, does nothing (\min A = \max A = \sum A = A). I do kind of like the fact that under this interpretation, these expressions can be interpreted in two different ways (either as elementwise min/max/sum or reduction min/max/sum) and it turns out the same either way.
OK, I just made two competing PRs for this.
Let me clarify:
If A \cdot B mean "contract any of the shared axes", then many of the examples (for instance attention) could be written this way without an explicit subscript. Do we want that?
For most our our operation the input type of our operator only can contract a fixed number of axes and only the ones in the type. If we have a min A
that means R^{...} -> R then that worries me a bit. I don't understand what that means in our system. Does it contract only the known axes or others that were not specfied (i.e. "batch")
That's a good point. I retract that comment (I guess I meant broadcasting, but I agree that is different).
OK, I see, and agree that operators that can work on all axes would make it more difficult to write equations that work correctly with things like minibatching.
https://twitter.com/yeewhye/status/1340031802212311041
"Can’t we do say A B for what you call circle dot, and A j B for what you use the dot (over j) for? Since one can contract multiple dimensions at same time, say A *{jk} B, one can also contact 0 dimensions (A * B)?"
From @ywteh