Open awav opened 4 years ago
There are two options.
c
:struct Model: Layer {
var a = Tensor<Float>(1.0)
var b = Tensor<Float>(2.0)
@noDerivative
var c = Tensor<Float>(3.0)
@differentiable
func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
return exp(log(b * a * input) + c)
}
}
Then the Model.TangentVector
doesn't have c
.
callAsFunction
:struct Model: Layer {
var a = Tensor<Float>(1.0)
var b = Tensor<Float>(2.0)
var c = Tensor<Float>(3.0)
@differentiable
func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
return exp(log(b * a * input) + withoutDerivative(at: c))
}
}
Then the Model.TangentVector
have c
but its value is 0.
@t-ae, thanks for examples. One more question: is there a way to do the trick with withoutDerivative
for a model from a third-party library? In TensorFlow, it is pretty easy, I can change variables to non-trainable.
@t-ae, looking at current solutions I realize that models in swift are not flexible enough for research purpose. It happens all the time that you need to learn only the parameters subset of the model. So, my question transforms into: "How to compute a gradient w.r.t. selected parameters without editing the model?"
I'm sorry but I don't have answer.
If the target is subLayer
, you can update parameter subset like this:
https://github.com/t-ae/stylegan-s4tf/blob/5010c3a9e8d4de045bd17e37aa5e8ecc83b9a5c5/Sources/train/main.swift#L27-L28
https://github.com/t-ae/stylegan-s4tf/blob/5010c3a9e8d4de045bd17e37aa5e8ecc83b9a5c5/Sources/train/main.swift#L88-L89
But it doesn't work in your case.
I think we need the feature like Keras's trainable (Maybe via PropertyWrapper?)
looking at current solutions I realize that models in swift are not flexible enough for research purpose. It happens all the time that you need to learn only the parameters subset of the model. So, my question transforms into: "How to compute a gradient w.r.t. selected parameters without editing the model?"
@awav: I'm curious why you want to "differentiate with respect to a subset of stored properties of a Differentiable
-conforming type"?
Is your use case optimization (preventing unnecessary differentiation), or something else? Your use case should drive this discussion!
Every Differentiable
-conforming type has a fixed TangentVector
associated type.
In your snippet, Model.TangentVector
is a struct with a
b
c
stored properties - this never changes.
import TensorFlow
struct Model: Layer {
var a, b, c: Tensor<Float>
// Compiler synthesizes:
// struct TangentVector: Differentiable, AdditiveArithmetic, PointwiseMultiplicative, ElementaryFunctions, VectorProtocol, KeyPathIterable {
// var a, b, c: Tensor<Float>
// ...
// }
...
}
If you apply a differential operator like func gradient
to Model
, you'll get back values of type Model.TangentVector
. This is fixed. No type exists representing just "derivatives of a
and b
".
Then the
Model.TangentVector
havec
but its value is 0.
In general, zero derivatives like this are correct.
struct Model: Layer {
var a = Tensor<Float>(1.0)
var b = Tensor<Float>(2.0)
var c = Tensor<Float>(3.0)
@differentiable
func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
return exp(log(b * a * input) + withoutDerivative(at: c))
}
}
withoutDerivative(at: c)
causes c
to have zero derivative. Small changes in input
cause no change in c
, hence a zero derivative.
@t-ae's response https://github.com/tensorflow/swift-apis/issues/691#issuecomment-590707440 is great!
@noDerivative
on stored properties so they don't appear in TangentVector
.
@noDerivative var c
if you never need derivatives with respect to c
.Use withoutDerivative(at:)
to prevent differentiation through specific values.
import TensorFlow
struct Model: Layer {
var a = Tensor<Float>(1.0)
var b = Tensor<Float>(2.0)
var c = Tensor<Float>(3.0)
@differentiable
func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
let c = withoutDerivative(at: self.c) // here
return exp(log(b * a * input) + c)
}
}
let c = withoutDerivative(at: self.c)
tells activity analysis not to mark c
as active and not to propagate activity through c
. This means that the compiler knows that c
doesn't need a derivative value and does less work. I think this ~addresses (or is set up to address) optimization concerns.+1 to @t-ae and @dan-zheng 's responses. One other thing to add:
if you'd like to dynamically control whether particular parameters are updated during the optimization process, you can also "zero-out" the computed derivative before updating the model's state. While this doesn't save on computation of the derivatives themselves (except on top of a lazy tensor runtime... stay tuned!), this would allow you fine-grained control of operations & updates.
There are a couple ways to spell the "zeroing-out". The most imperative might be something like:
var grads = gradient(at: model) { model in loss(model(input)) }
grads.c = .zero
optimizer.update(&model, along: grads)
but other ways of spelling this include property wrappers (as @t-ae alluded to).
In order to provide more effective help, could you perhaps describe what you're trying to do // your larger context?
There are a couple ways to spell the "zeroing-out". The most imperative might be something like:
var grads = gradient(at: model) { model in loss(model(input)) } grads.c = .zero optimizer.update(&model, along: grads)
Note: using grads.c = .zero
is actually problematic in general and not recommended. When per-instance zero tangent vectors are done, you can write:
grads.c = grads.zeroTangentVector // `grads.zeroTangentVector` has the same shape as `grads`
The Differentiable.zeroTangentVector
instance property is more correct than the AdditiveArithmetic.zero
static property in some cases (e.g. for types containing arrays and tensors).
https://github.com/tensorflow/swift-apis/issues/656#issuecomment-588080813 has more info on zeroTangentVector
, if you're interested. zeroTangentVector
is important for automatic differentiation correctness, too.
@awav: I hope your question was answered! Your use case isn't clear, so it's have to provide a targeted suggestion. Feel free to reopen if you'd like to chat more.
@awav: I hope your question was answered! Your use case isn't clear, so it's have to provide a targeted suggestion. Feel free to reopen if you'd like to chat more.
@dan-zheng, :D I was writing an answer when you closed it. Can you re-open it?
Is your use case optimization (preventing unnecessary differentiation), or something else? Your use case should drive this discussion!
@dan-zheng, well I can give you two examples. First one is the Gaussian process model. This type of models can have variational parameters and hyperparameters (kernel parameters). The training loss is ELBO (evidence lower bound). It is common to train these type of models with natural gradients updating only variational parameters and use SGD-like optimisers for hyperparameters. So, one iteration step consists of two optimiser updates. The second example is about debugging/prototyping/making new models, sometimes you (researcher/user) know the optimum for a subset of parameters and you want to check the difference between losses with and without true values. Another scenario when you want to hold a part of the model to see how it affects the training procedure. Often this technique is very effective, you may call it burning or warming up before training.
@awav: I'm curious why you want to "differentiate with respect to a subset of stored properties of a Differentiable-conforming type"?
I'm used to that. TensorFlow and Pytorch have Module
and normally models are bags of parameters and optimisers accept a list of variables instead of differentiable objects :)
Can you re-open it?
Sure! Feel free to do it yourself, since it's your unresolved issue!
well I can give you two examples.
Could you please share some reference implementations? I'm not an expert, so code will greatly help me understand your examples.
First one is the Gaussian process model. This type of models can have variational parameters and hyperparameters (kernel parameters). The training loss is ELBO (evidence lower bound). It is common to train these type of models with natural gradients updating only variational parameters and use SGD-like optimisers for hyperparameters. So, one iteration step consists of two optimiser updates.
To me, "one iteration step consists of two optimizer updates" sounds expressible to me in Swift. Differential operators are higher-order functions (functional APIs) that produce T.TangentVector
values. You can pass T
and T.TangentVector
values around freely and update T
with T.TangentVector
(i.e. optimize T
) however you like.
Some folks have implemented non-trivial optimization algorithms in Swift using these APIs, which give me confidence that the fundamental design is okay!
I'm used to that. TensorFlow and Pytorch have
Module
and normally models are bags of parameters and optimisers accept a list of variables instead of differentiable objects :)
Yes. Differentiable Swift's programming model is different from these frameworks, for good reason (we have types and can't be as dynamic!). I still feel it's possible to ~accomplish your use case (sounds like you care about derivative optimization, avoiding inefficient zero derivatives) given the current differentiation APIs in Swift. And if not, that's a great opportunity for us to brainstorm support for unsupported use cases.
The custom differentiation tutorial shows some cool techniques like derivative surgery Differentiable.withDerivative(_:)
, rematerialization instead of checkpointing, and manually chaining pullbacks. The last technique may be relevant to "one iteration step consists of two optimizer updates".
If you'd like, please share code that you don't know how to write in Swift! We can make progress on specific examples. 🙂
Sure! Feel free to do it yourself, since it's your unresolved issue!
I didn't have a button to reopen the issue :)
Could you please share some reference implementations? I'm not an expert, so code will greatly help me understand your examples.
Examples of natural gradients and adam optimisers for training SVGP model.
Sounds like you care about derivative optimization, avoiding inefficient zero derivatives
Yes, this is a part of the concern.
The custom differentiation tutorial shows some cool techniques like derivative surgery Differentiable.withDerivative(_:), rematerialization instead of checkpointing, and manually chaining pullbacks. The last technique may be relevant to "one iteration step consists of two optimizer updates".
Thanks, I will try it and will give feedback soon.
Hello all,
I have a very simple question. How to compute gradients w.r.t. a subset of parameters. Let's say I have a model like:
How do I compute a gradient w.r.t. to only
a
andb
?