tensorflow / swift-apis

Swift for TensorFlow Deep Learning Library
Apache License 2.0
794 stars 134 forks source link

Q: How to compute a gradient w.r.t. selected parameters? #691

Open awav opened 4 years ago

awav commented 4 years ago

Hello all,

I have a very simple question. How to compute gradients w.r.t. a subset of parameters. Let's say I have a model like:

struct Model: Layer {
    var a = Tensor<Float>(1.0)
    var b = Tensor<Float>(2.0)
    var c = Tensor<Float>(3.0)

    func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
        return exp(log(b * a * input) + c)

How do I compute a gradient w.r.t. to only a and b?

t-ae commented 4 years ago

There are two options.

  1. If you want to totally disable the gradient for c:
struct Model: Layer {
    var a = Tensor<Float>(1.0)
    var b = Tensor<Float>(2.0)
    var c = Tensor<Float>(3.0)

    func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
        return exp(log(b * a * input) + c)

Then the Model.TangentVector doesn't have c.

  1. If you want to disable just in callAsFunction:
struct Model: Layer {
    var a = Tensor<Float>(1.0)
    var b = Tensor<Float>(2.0)
    var c = Tensor<Float>(3.0)

    func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
        return exp(log(b * a * input) + withoutDerivative(at: c))

Then the Model.TangentVector have c but its value is 0.

awav commented 4 years ago

@t-ae, thanks for examples. One more question: is there a way to do the trick with withoutDerivative for a model from a third-party library? In TensorFlow, it is pretty easy, I can change variables to non-trainable.

awav commented 4 years ago

@t-ae, looking at current solutions I realize that models in swift are not flexible enough for research purpose. It happens all the time that you need to learn only the parameters subset of the model. So, my question transforms into: "How to compute a gradient w.r.t. selected parameters without editing the model?"

t-ae commented 4 years ago

I'm sorry but I don't have answer.

If the target is subLayer, you can update parameter subset like this: https://github.com/t-ae/stylegan-s4tf/blob/5010c3a9e8d4de045bd17e37aa5e8ecc83b9a5c5/Sources/train/main.swift#L27-L28 https://github.com/t-ae/stylegan-s4tf/blob/5010c3a9e8d4de045bd17e37aa5e8ecc83b9a5c5/Sources/train/main.swift#L88-L89

But it doesn't work in your case.

I think we need the feature like Keras's trainable (Maybe via PropertyWrapper?)

dan-zheng commented 4 years ago

looking at current solutions I realize that models in swift are not flexible enough for research purpose. It happens all the time that you need to learn only the parameters subset of the model. So, my question transforms into: "How to compute a gradient w.r.t. selected parameters without editing the model?"

@awav: I'm curious why you want to "differentiate with respect to a subset of stored properties of a Differentiable-conforming type"?

Is your use case optimization (preventing unnecessary differentiation), or something else? Your use case should drive this discussion!

Every Differentiable-conforming type has a fixed TangentVector associated type.

In your snippet, Model.TangentVector is a struct with a b c stored properties - this never changes.

import TensorFlow
struct Model: Layer {
  var a, b, c: Tensor<Float>

  // Compiler synthesizes:
  // struct TangentVector: Differentiable, AdditiveArithmetic, PointwiseMultiplicative, ElementaryFunctions, VectorProtocol, KeyPathIterable {
  //   var a, b, c: Tensor<Float>
  //   ...
  // }

If you apply a differential operator like func gradient to Model, you'll get back values of type Model.TangentVector. This is fixed. No type exists representing just "derivatives of a and b".

Then the Model.TangentVector have c but its value is 0.

In general, zero derivatives like this are correct.

struct Model: Layer {
    var a = Tensor<Float>(1.0)
    var b = Tensor<Float>(2.0)
    var c = Tensor<Float>(3.0)

    func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
        return exp(log(b * a * input) + withoutDerivative(at: c))

withoutDerivative(at: c) causes c to have zero derivative. Small changes in input cause no change in c, hence a zero derivative.

dan-zheng commented 4 years ago

@t-ae's response https://github.com/tensorflow/swift-apis/issues/691#issuecomment-590707440 is great!

saeta commented 4 years ago

+1 to @t-ae and @dan-zheng 's responses. One other thing to add:

if you'd like to dynamically control whether particular parameters are updated during the optimization process, you can also "zero-out" the computed derivative before updating the model's state. While this doesn't save on computation of the derivatives themselves (except on top of a lazy tensor runtime... stay tuned!), this would allow you fine-grained control of operations & updates.

There are a couple ways to spell the "zeroing-out". The most imperative might be something like:

var grads = gradient(at: model) { model in loss(model(input)) }
grads.c = .zero
optimizer.update(&model, along: grads)

but other ways of spelling this include property wrappers (as @t-ae alluded to).

In order to provide more effective help, could you perhaps describe what you're trying to do // your larger context?

dan-zheng commented 4 years ago

There are a couple ways to spell the "zeroing-out". The most imperative might be something like:

var grads = gradient(at: model) { model in loss(model(input)) }
grads.c = .zero
optimizer.update(&model, along: grads)

Note: using grads.c = .zero is actually problematic in general and not recommended. When per-instance zero tangent vectors are done, you can write:

grads.c = grads.zeroTangentVector // `grads.zeroTangentVector` has the same shape as `grads`

The Differentiable.zeroTangentVector instance property is more correct than the AdditiveArithmetic.zero static property in some cases (e.g. for types containing arrays and tensors).

https://github.com/tensorflow/swift-apis/issues/656#issuecomment-588080813 has more info on zeroTangentVector, if you're interested. zeroTangentVector is important for automatic differentiation correctness, too.

dan-zheng commented 4 years ago

@awav: I hope your question was answered! Your use case isn't clear, so it's have to provide a targeted suggestion. Feel free to reopen if you'd like to chat more.

awav commented 4 years ago

@awav: I hope your question was answered! Your use case isn't clear, so it's have to provide a targeted suggestion. Feel free to reopen if you'd like to chat more.

@dan-zheng, :D I was writing an answer when you closed it. Can you re-open it?

Is your use case optimization (preventing unnecessary differentiation), or something else? Your use case should drive this discussion!

@dan-zheng, well I can give you two examples. First one is the Gaussian process model. This type of models can have variational parameters and hyperparameters (kernel parameters). The training loss is ELBO (evidence lower bound). It is common to train these type of models with natural gradients updating only variational parameters and use SGD-like optimisers for hyperparameters. So, one iteration step consists of two optimiser updates. The second example is about debugging/prototyping/making new models, sometimes you (researcher/user) know the optimum for a subset of parameters and you want to check the difference between losses with and without true values. Another scenario when you want to hold a part of the model to see how it affects the training procedure. Often this technique is very effective, you may call it burning or warming up before training.

@awav: I'm curious why you want to "differentiate with respect to a subset of stored properties of a Differentiable-conforming type"?

I'm used to that. TensorFlow and Pytorch have Module and normally models are bags of parameters and optimisers accept a list of variables instead of differentiable objects :)

dan-zheng commented 4 years ago

Can you re-open it?

Sure! Feel free to do it yourself, since it's your unresolved issue!

well I can give you two examples.

Could you please share some reference implementations? I'm not an expert, so code will greatly help me understand your examples.

First one is the Gaussian process model. This type of models can have variational parameters and hyperparameters (kernel parameters). The training loss is ELBO (evidence lower bound). It is common to train these type of models with natural gradients updating only variational parameters and use SGD-like optimisers for hyperparameters. So, one iteration step consists of two optimiser updates.

To me, "one iteration step consists of two optimizer updates" sounds expressible to me in Swift. Differential operators are higher-order functions (functional APIs) that produce T.TangentVector values. You can pass T and T.TangentVector values around freely and update T with T.TangentVector (i.e. optimize T) however you like.

Some folks have implemented non-trivial optimization algorithms in Swift using these APIs, which give me confidence that the fundamental design is okay!

I'm used to that. TensorFlow and Pytorch have Module and normally models are bags of parameters and optimisers accept a list of variables instead of differentiable objects :)

Yes. Differentiable Swift's programming model is different from these frameworks, for good reason (we have types and can't be as dynamic!). I still feel it's possible to ~accomplish your use case (sounds like you care about derivative optimization, avoiding inefficient zero derivatives) given the current differentiation APIs in Swift. And if not, that's a great opportunity for us to brainstorm support for unsupported use cases.

The custom differentiation tutorial shows some cool techniques like derivative surgery Differentiable.withDerivative(_:), rematerialization instead of checkpointing, and manually chaining pullbacks. The last technique may be relevant to "one iteration step consists of two optimizer updates".

If you'd like, please share code that you don't know how to write in Swift! We can make progress on specific examples. 🙂

awav commented 4 years ago

Sure! Feel free to do it yourself, since it's your unresolved issue!

I didn't have a button to reopen the issue :)

Could you please share some reference implementations? I'm not an expert, so code will greatly help me understand your examples.

Examples of natural gradients and adam optimisers for training SVGP model.

Sounds like you care about derivative optimization, avoiding inefficient zero derivatives

Yes, this is a part of the concern.

The custom differentiation tutorial shows some cool techniques like derivative surgery Differentiable.withDerivative(_:), rematerialization instead of checkpointing, and manually chaining pullbacks. The last technique may be relevant to "one iteration step consists of two optimizer updates".

Thanks, I will try it and will give feedback soon.