Open eaplatanios opened 5 years ago
Thanks for the write-up! I'll tag #322, Numeric.Magnitude
and AdditiveArithmetic
discussions on the evolution forum so that people looking at this issue can also get some context from those threads.
Yeah, everything seems like a good direction for the standard library up till the reduction example, which I think is a great motivating example for having library-customizable conformance derivation sooner. Let's continue the two Swift Evolution discussions
By the way, reading the AdditiveArithmetic
discussions, I tend to agree with having a separate protocol called PointwiseAdditive
which have the same requirements but is derived differently. In this case, memberwise automatic derivations kick in only for PointwiseAdditive
and this fits nicely with the "pointwise" semantics as "pointwise" sort of implies "memberwise".
Regarding reductions, I believe Swift-side type class derivation support would be amazing and would definitely be an elegant solution to this.
FWIW, I'd love to see a bigger discusison of this, because I tend to agree that we should be using something different than the core language protocols for our layer abstractions.
Among other things, we have multiprecision floating point to deal with in our optimizers, and the core protocols really aren't designed to deal with that. I'd love to see a discussion about these topics on the mailing list...
Happy to start with a discussion on the mailing list going forward!
@rxwei @dan-zheng I wasn't sure where to put this, but I believe an issue here is a good place to collect our thoughts and comments. My initial thoughts are:
Pointwise Multiplicative
I have a couple comments about
PointwiseMultiplicative
:AdditiveArithmetic
defines-
and-=
, I believe we should define/
and/=
forPointwiseMultiplicative
thus enabling efficient division and making this dual toAdditiveArithmetic
. It may not be very formal, but given that most of the math-related protocols are not and are more geared towards practicality I think this is fine. Also, for our purposes these protocols are used over aggregates of tensors where/
and/=
can defined so this change should be fine for our purposes. What do you think?MultiplicativeArithmetic
, or renameAdditiveArithmetic
toPointwiseAdditive
. I personally prefer the latter since it will also allow for consistency with e.g.PointwiseComparative
, but not sure how that would go with the Swift evolution process.Optimizers
In order to simplify the remaining optimizers we need to add support for comparisons (e.g.,
max
) and for computing the absolute value of tensors element-wise.For comparisons, I believe something along the lines of the following would be great:
I'm not sure about the absolute value, but I believe we may be able to do something like:
Reductions
We need some way to perform reductions over tensor aggregates. This comes up quite a lot in machine learning. For example, we often want to know the max over all elements in an aggregate. Or, for a more practical motivating example consider clipping gradients based on the global norm over the aggregate structure. This would require us to compute the norm of each tensor in the aggregate (norm[t]) and then compute:
Say we can compute
sqrt(_:)
andsquared()
using a conformance toElementaryFunctions
. How do we go about the sum reduction over the aggregate?Adding support for reductions introduces a couple of challenges. First, we would need to know the
Scalar
type of all tensors in the structure and force it to be the same for all. Alternatively, we can follow a similar approach toVectorProtocol
and useFloat
for all tensors. However, in that case wouldn't we lose precision when dealing with sayDouble
tensors (this problem also applies toVectorProtocol
actually so what how do you handle it there?)? We could avoid this by having aScalar
type (which would also require all layers to define aScalar
type -- @rxwei you mentioned though that we want to avoid this to potentially allow for mixed-precision training). In either case, I believe this is worth a discussion.Also, reducing over an aggregate would require a protocol that looks something like this:
This seems overly complicated so maybe we can find a better solution? One nice thing about using a
Scalar
type is that it may remove the need for aReducible
protocol by allowing users to perform reductions manually usingKeyPathIterable
. For example, my current implementation for clipping by global norm look like this:Of course it doesn't have to be defined as an extension to
KeyPathIterable
, but I use this for now because I cannot yet define it as an extension toLayer.TangentVector
.What are your thoughts on the above? Also, why do we call
VectorProtocol
that instead ofVectorSpace
?