tensorflow / swift-apis

Swift for TensorFlow Deep Learning Library
Apache License 2.0
793 stars 133 forks source link

Math Protocols #325

Open eaplatanios opened 5 years ago

eaplatanios commented 5 years ago

@rxwei @dan-zheng I wasn't sure where to put this, but I believe an issue here is a good place to collect our thoughts and comments. My initial thoughts are:

Pointwise Multiplicative

I have a couple comments about PointwiseMultiplicative:

  1. Similar to how AdditiveArithmetic defines - and -=, I believe we should define / and /= for PointwiseMultiplicative thus enabling efficient division and making this dual to AdditiveArithmetic. It may not be very formal, but given that most of the math-related protocols are not and are more geared towards practicality I think this is fine. Also, for our purposes these protocols are used over aggregates of tensors where / and /= can defined so this change should be fine for our purposes. What do you think?
  2. Following from point 1, if we aim for consistency with the standard library we may want to call this MultiplicativeArithmetic, or rename AdditiveArithmetic to PointwiseAdditive. I personally prefer the latter since it will also allow for consistency with e.g. PointwiseComparative, but not sure how that would go with the Swift evolution process.

Optimizers

In order to simplify the remaining optimizers we need to add support for comparisons (e.g., max) and for computing the absolute value of tensors element-wise.

For comparisons, I believe something along the lines of the following would be great:

public protocol PointwiseEquatable {
  associatedtype Boolean
  static func == (lhs: Self, rhs: Self) -> Boolean
}

public protocol PointwiseComparable: PointwiseEquatable {
  static func < (lhs: Self, rhs: Self) -> Boolean
  static func <= (lhs: Self, rhs: Self) -> Boolean
  static func > (lhs: Self, rhs: Self) -> Boolean
  static func >= (lhs: Self, rhs: Self) -> Boolean
  static func max(lhs: Self, rhs: Self) -> Boolean
  static func min(lhs: Self, rhs: Self) -> Boolean
}

I'm not sure about the absolute value, but I believe we may be able to do something like:

public protocol PointwiseMagnitude { // ???
  func abs() -> Self
}

Reductions

We need some way to perform reductions over tensor aggregates. This comes up quite a lot in machine learning. For example, we often want to know the max over all elements in an aggregate. Or, for a more practical motivating example consider clipping gradients based on the global norm over the aggregate structure. This would require us to compute the norm of each tensor in the aggregate (norm[t]) and then compute:

globalNorm = sqrt(sum([norm[t].squared() for t in tensors]))

Say we can compute sqrt(_:) and squared() using a conformance to ElementaryFunctions. How do we go about the sum reduction over the aggregate?

Adding support for reductions introduces a couple of challenges. First, we would need to know the Scalar type of all tensors in the structure and force it to be the same for all. Alternatively, we can follow a similar approach to VectorProtocol and use Float for all tensors. However, in that case wouldn't we lose precision when dealing with say Double tensors (this problem also applies to VectorProtocol actually so what how do you handle it there?)? We could avoid this by having a Scalar type (which would also require all layers to define a Scalar type -- @rxwei you mentioned though that we want to avoid this to potentially allow for mixed-precision training). In either case, I believe this is worth a discussion.

Also, reducing over an aggregate would require a protocol that looks something like this:

public protocol Reducible {
  associatedtype Scalar

  func sum() -> Scalar where Scalar: AdditiveArithmetic
  func mean() -> Scalar where Scalar: AdditiveArithmetic
  func product() -> Scalar where Scalar: PointwiseMultiplicative

  // ... more reductions such as comparison-based reductions.

  // This needs to be used by the `_meanHelper()` for example.
  func count() -> Scalar

  // The following are needed for applying the reduction across the reduced members.
  static func _sumHelper(_ x: Scalar, _ y: Scalar) -> Scalar where Scalar: AdditiveArithmetic
  static func _meanHelper(_ x: Scalar, _ y: Scalar) -> Scalar where Scalar: AdditiveArithmetic
  static func _productHelper(_ x: Scalar, _ y: Scalar) -> Scalar where Scalar: PointwiseMultiplicative
}

This seems overly complicated so maybe we can find a better solution? One nice thing about using a Scalar type is that it may remove the need for a Reducible protocol by allowing users to perform reductions manually using KeyPathIterable. For example, my current implementation for clipping by global norm look like this:

extension KeyPathIterable {
  public mutating func clipByGlobalNorm<Scalar: TensorFlowFloatingPoint>(clipNorm: Scalar) {
    let clipNorm = Tensor<Scalar>(clipNorm)
    var globalNorm = Tensor<Scalar>(zeros: [])
    for kp in self.recursivelyAllWritableKeyPaths(to: Tensor<Scalar>.self) {
      globalNorm += self[keyPath: kp].squared().sum()
    }
    globalNorm = sqrt(globalNorm)
    for kp in self.recursivelyAllWritableKeyPaths(to: Tensor<Scalar>.self) {
      self[keyPath: kp] *= clipNorm / max(globalNorm, clipNorm)
    }
  }
}

Of course it doesn't have to be defined as an extension to KeyPathIterable, but I use this for now because I cannot yet define it as an extension to Layer.TangentVector.

What are your thoughts on the above? Also, why do we call VectorProtocol that instead of VectorSpace?

rxwei commented 5 years ago

Thanks for the write-up! I'll tag #322, Numeric.Magnitude and AdditiveArithmetic discussions on the evolution forum so that people looking at this issue can also get some context from those threads.

Yeah, everything seems like a good direction for the standard library up till the reduction example, which I think is a great motivating example for having library-customizable conformance derivation sooner. Let's continue the two Swift Evolution discussions

eaplatanios commented 5 years ago

By the way, reading the AdditiveArithmetic discussions, I tend to agree with having a separate protocol called PointwiseAdditive which have the same requirements but is derived differently. In this case, memberwise automatic derivations kick in only for PointwiseAdditive and this fits nicely with the "pointwise" semantics as "pointwise" sort of implies "memberwise".

Regarding reductions, I believe Swift-side type class derivation support would be amazing and would definitely be an elegant solution to this.

lattner commented 5 years ago

FWIW, I'd love to see a bigger discusison of this, because I tend to agree that we should be using something different than the core language protocols for our layer abstractions.

Among other things, we have multiprecision floating point to deal with in our optimizers, and the core protocols really aren't designed to deal with that. I'd love to see a discussion about these topics on the mailing list...

rxwei commented 5 years ago

Happy to start with a discussion on the mailing list going forward!