tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.
https://burn.dev
Apache License 2.0
7.83k stars 374 forks source link

Proposal: a functional API for Burn #226

Closed jmacglashan closed 1 year ago

jmacglashan commented 1 year ago

Feature description

Build a functional API for burn, in which the gradient of the arguments of simple functions can be computed and in which Rust structures themselves can be differentiable types.

You can find a repository of the proof of concept here: https://github.com/jmacglashan/burn-functional

Feature motivation

It is my personal opinion that for Rust to be a viable option for ML researchers, Rust needs to have a first-class autodifferentiation library that keeps the key programming concepts close to the mathematical ideas researchers are thinking about: functions, gradients, and how you compute them. Inspired by previous AD efforts like Swift for Tensorflow and JAX, I believe a functional API is one of the best ways to achieve that and can be accomplished by building on the strong and flexible foundation of Burn.

Action items

The work I've done toward this API is a proof of concept. It's also something that could be defined in its own separate crate without any other changes in Burn. However, I think it would be useful if Burn adopted an API like this, and I think making it first class in Burn would go a long way towards getting ML researchers interested in using Rust.

As such, I'd like to hear feedback on this kind of API and interest in incorporating it into Burn. There are lots of rough edges in it right now. In fact, I learned how to use Rust Macros for the first time in working on it! Consequently, I would not immediately put it into Burn. But it is far enough along with enough examples of how to build on it that I think it's ready for feedback.

I'd like to know

If there is interest in putting this into Burn proper, I'd also be willing to spend time polishing it.


Description and explanation of proposal

Below I've included all my comments on the readme of the repo that shows how to use a proof of concept functional API for Burn and how the PoC works. The full repository can be found here: https://github.com/jmacglashan/burn-functional

Computing the gradient of function argments

Let's start with a simple example, computing the gradient of a simple function that multiplies two tensors.

fn main() {
    type B = NdArrayBackend<f32>;  // choose a backend

    // Make some example random tensor data
    let x: Tensor<B, 2> = Tensor::random([3, 2], Distribution::Standard);
    let y: Tensor<B, 2> = Tensor::random([3, 2], Distribution::Standard);

    // Define a function we want to differentiate. with_grad will make a new function that computes the
    // the value and returns the gradients of its arguments
    #[with_grad]
    fn my_func<B: Backend>(x: &Tensor<B, 2>, y: &Tensor<B, 2>) -> Tensor<B, 2> {
        x.mul(&y)
    }

    // get the forward value of the function and the gradient of each of its arguments
    let (z, (grad_x, grad_y)) = my_func_with_grad(&x, &y);
    println!("{:?}\n{:?}\n{:?}", z, grad_x, grad_y);
    println!("Done")
}

This is nice! We get to reason about gradients of a function in a natural way. Of course, most ML work is not going to be expressed as operations of a handful of tensors. Rather, we're going to build large structured models of many tensors.

Differentiable structs

To facilitate that, you can compute gradients not just of tensors, but of Rust structures your define that consist of other differentiable structures and tensors! Doing so only requires that you add a derive DifferentiableVariable to your structure. Here we replicate the previous program, but instead of two tensors, we'll make our own differentiable struct to hold them. Then we make our differentiable function take our struct as an argument instead.

#[derive(DifferentiableVariable, Debug)]
struct MyStruct<B: Backend> {
    x: Tensor<B, 2>,
    y: Tensor<B, 2>,
}

fn main() {
    type B = NdArrayBackend<f32>;

    let x: Tensor<B, 2> = Tensor::random([3, 2], Distribution::Standard);
    let y: Tensor<B, 2> = Tensor::random([3, 2], Distribution::Standard);
    let my_struct: MyStruct<B> = MyStruct { x, y };
    println!("{:?}", my_struct,);

    #[with_grad]
    fn my_func<B: Backend>(my_struct: &MyStruct<B>) -> Tensor<B, 2> {
        my_struct.x.mul(&my_struct.y)
    }

    // grad_struct is a MyStruct instance, but the fields hold the gradients!
    let (y, grad_struct) = my_func_with_grad(&my_struct);
    println!("{:?}\n{:?}", y, grad_struct);
    println!("Done")
}

Differentiable containers

It's not just our own structs that are differentiable. Vectors and tuples of differentiable types are also differentiable! Vectors have the advantage that they can be arbitrarily sized, while tuples are fixed size, but allow you to group together different differentiable types.

Below we rewrite the previous program again using a vector instead of our custom struct.

fn main() {
    type B = NdArrayBackend<f32>;

    let x: Tensor<B, 2> = Tensor::random([3, 2], Distribution::Standard);
    let y: Tensor<B, 2> = Tensor::random([3, 2], Distribution::Standard);
    let tensor_vec = vec![x, y];
    println!("{:?}", tensor_vec);

    #[with_grad]
    fn my_func<B: Backend>(tensor_vec: &Vec<Tensor<B, 2>>) -> Tensor<B, 2> {
        tensor_vec[0].mul(&tensor_vec[1])
    }

    let (z, grad_vec) = my_func_with_grad(&tensor_vec);
    println!("{:?}\n{:?}", z, grad_vec);
    println!("Done")
}

Other basic containers like Options and PhantomData are also differentiable types, and the list can be easily extended to other standard Rust constructs (e.g., HashMaps).

A neural net interface

Being able to compute gradients of regular functions for a wide range of types gets us almost all of the way to building and optimizing neural nets. We really only need two additional features. The ability to do math operations between a type and its gradient, and an interface for "forwarding" models. All differentiable types (e.g., structs with a DifferentiableVariable derive) already come with methods for math operations with their gradients. For forwarding models a Layer trait is provided.

pub trait Layer {
    type Input;
    type Output;

    fn forward(&self, x: &Self::Input) -> Self::Output;
}

The Input and Output are trait defined types because it is not uncommon to different layer input types. E.g., sometimes the input is just a tensor, but other times it might be vec of tensors, or tuple, etc. For models with mixed inputs and outputs, adapter layers can bridge the gap and provide static checking to make sure layers are compatible with each other.

To this end, just like Vectors and Tuples are differentiable types if their elements are, they are also sequential Layers applying each output as the input of the next layer, if their elements are compatible layers. This is nice in that instead of defining a special class for sequential lists of layers like other NN Python libraries, we can just use regular old Rust Vectors and tuples.

Let's give one final example pulling all these pieces together to define an MLP and train it with SGD to optimize some random data.

#[derive(DifferentiableVariable)]
struct MLP<B: Backend> {
    hidden_layers: Vec<(Affine<B>, Relu<B, 2>)>,
    output_layer: Affine<B>,
}

impl<B: Backend> Layer for MLP<B> {
    type Input = Tensor<B, 2>;
    type Output = Tensor<B, 2>;

    fn forward(&self, x: &Self::Input) -> Self::Output {
        let y = self.hidden_layers.forward(x);
        let y = self.output_layer.forward(&y);
        y
    }
}

fn main() {
    type B = NdArrayBackend<f32>;

    // Make some illustrative random training data (32 examples of instances with 3 features and one regression target)
    let x: Tensor<B, 2> = Tensor::random([32, 3], Distribution::Standard);
    let label: Tensor<B, 2> = Tensor::random([32, 1], Distribution::Standard);

    // Make our model
    let mut model: MLP<B> = MLP {
        hidden_layers: vec![
            (Affine::xavier(3, 64, true), Relu::new()),
            (Affine::xavier(64, 64, true), Relu::new()),
        ],
        output_layer: Affine::xavier(64, 1, true),
    };

    // Define a loss function. with_grad will automatically make loss_fn_with_grad to get gradient of its arguments
    // nargs = 1 indicates we only want the _with_grad function to return the gradient for the first argument: the model
    #[with_grad(nargs = 1)]
    fn loss_fn<B: Backend>(model: &MLP<B>, x: &Tensor<B, 2>, label: &Tensor<B, 2>) -> Tensor<B, 1> {
        model.forward(x).sub(label).powf(2.0).mean()
    }

    // Train the model to minimize the loss with SGD
    let mut sgd = SGD(0.01);
    for i in 1..=50 {
        // model_grad is also an MLP<B> struct!
        let (loss, model_grad) = loss_fn_with_grad(&model, &x, &label);
        // could directly apply SGD: model = model.sub_grad(&model_grad.mul_scalar(learning_rate));
        // or use an SGD optimizer as shown below
        model = sgd.update(&model, &model_grad);
        println!("Step {} loss: {:?}", i, loss);
    }

    println!("Done");
}

In this case, I've defined the Affine and Relu layer, and the SGD optimizer elsewhere in the library, but they too are defined as simple structs using the DifferentiableVariable derive. The SGD optimizer is an implementation of a simple Optimizer trait:

pub trait Optimizer<M: GradientOps> {
    fn update(&mut self, model: &M, grads: &M::Gradient) -> M;
}
pub struct SGD(pub f32);

impl<M: GradientOps> Optimizer<M> for SGD
where
    M::Gradient: GradientOps<Gradient = M::Gradient>,
{
    fn update(&mut self, model: &M, grads: &M::Gradient) -> M {
        model.sub_grad(&grads.mul_scalar(self.0))
    }
}

One of the nice things about gradients being structured types with math support is it makes defining optimizers very easy and to the point.

How it works

Without getting too deep in this readme, there are three traits that get implemented by the DifferentiableVariable derive and which are implemented for all the other types: ToADTraced<AB>, DifferentiableVariable<AB>, and GradientOps.

ToADTraced

The ToADTraced<AB> trait is responsible for taking a differentiable type and converting the type backend to a backend that support autodiff.

pub trait ToADTraced<AB: ADBackend> {
    type TracedType;
    fn to_traced(&self) -> Self::TracedType;
}

Because the trait has a generic ADBackened type, it allows any possible ADBackend to be used in the with_grad functions. (Right now, with_grad will use ADBackendDecorator but this can be configurable in the future. More on that later.) The derive macro implements it for structs simply by returning a new version of itself with each to_traced() called on each of its fields. There is an assumption that this structure construction is cheap because Tensor data is backed by an Arc where clones and adding ADBackeds is cheap.

DifferentiableVariable

The DifferentiableVariable<AB> trait is responsible for taking a differentiable struct with an ADBackend, receicing the ADBackend:Gradients type, and constructing a new version of itself with each of its fields bound to the Gradients value in the original inner backend.

pub trait DifferentiableVariable<AB: ADBackend> {
    type Gradient;
    fn bind_grad(&self, grads: &AB::Gradients) -> Self::Gradient;
}

Like ToAdTraced, the derive macro works by creating a new struct and recursively calling bind_grad on each of its fields.

GradientOps

The GradientOps trait is responsible for providing math operations between a differentiable type and its gradients.

pub trait GradientOps {
    type Gradient;
    fn add_grad(&self, rhs: &Self::Gradient) -> Self;
    fn sub_grad(&self, rhs: &Self::Gradient) -> Self;
    fn mul_grad(&self, rhs: &Self::Gradient) -> Self;
    fn mul_scalar(&self, rhs: f32) -> Self;
}

Almost always (if not always), the Gradient type is Self. I have left this flexible in case there are special cases, but it may be worthwhile to just assume its self. Furthermore, you often want to do math operations on gradient types themselves (e.g, scale the gradient with a learning rate), in which case having the Gradient type also support GradientOps is useful. When Gradient=Self this is trivially supported.

Like the other traits, the derive macro implements these methods by making a new struct with its field constructed by recursively applying these operations.

At present, only a handful of operations are provided, but this set could be expanded to something closer to the TensorOps trait.

Differentiable

There is also a fourth, though less important, trait named Differentiable which allows you to initiate the backward operation on a type and extract the inner backend value of itself.

pub trait Differentiable<Gradients> {
    type InnerType;
    fn extract_traced(&self) -> Self::InnerType;
    fn backward(&self) -> Gradients;
}

At present, it's only implemented for Tensor types and is not automatically implemented with the DifferentiableVariable derive. That's because it's unclear what a backward pass on a struct of tensors should be. A possible default option is to just sum reduce all tensors recursively contained in the struct, but maybe a reduce mean would be better in other cases. Since there isn't a clear option, it's not automatically implemented and requires the user to decide. More of then not though, I believe most users would simply create a Tensor from a struct manually in their loss function to differentiate, in which case the implementation on Tensor is sufficient.

with_grad

The with_grad proc macro makes a function that computes the gradient by first turning all of its arguments into ADTraced versions of themselves with to_traced(). Then it calls the original function on those traced values. It takes the output and calls backward() on it to get the ADBacked::Gradients. Finally, it calls bind_grad() on each of its arguments and returns it.

For example, if we manually implemented with_grad for our first example, this is the function it would create.

fn my_func_with_grad<B: Backend>(x: &Tensor<B, 2>, y: &Tensor<B, 2>) -> (Tensor<B, 2>, (Tensor<B, 2>, Tensor<B, 2>)) {
    let x: Tensor<ADBackendDecorator<B>, 2> = x.to_traced();
    let y: Tensor<ADBackendDecorator<B>, 2> = y.to_traced();
    let __y = my_func(&x, &y);
    let grads = __y.backward();
    (__y.extract_traced(), (x.bind_grad(&grads), y.bind_grad(&grads)))
}

At the moment with_grad always uses the ADBackendDecorator, and I believe that is currently the only ADBacked in Burn. However, it would be straightforward to provide a macro argument that let you choose the ADBacked and simply default it to ADBackendDecorator.

Non-differentiable types

Not all types are conventionally differentiable. For example, any standard rust primitive is not, and more generally, any type that doesn't involve a Burn Backend type is not differentiable. Nevertheless, we may want to define differentiable structs that include these types for configuring how they work. For this reason, the marker trait Constant is provided. Any type that implements Constant and is cloneable automatically gets implementations of the three main type traits that simply return clones of themselves acting as static values.

Constant is automatically implements for a subset of Rust primitives like usize, f32, and so on. Because of trait implementation conflicts, this excludes being implemented for Vectors and Tuples (It was not clear to me how to avoid trait implementation conflicts given that Vector is differentiable if its elements are). For situations likes these or for when there is some external non-differentiable type you want to use, there is also a ConstantCell container you can use to wrap the type in one that makes it Constant.

An example usage of Constant on a custom struct is shown below.

#[derive(DifferentiableVariable, Debug)]
struct StructWithConstants<B: Backend> {
    x: Tensor<B, 2>,
    plain_val: usize,
    other_val: ConstantCell<Vec<usize>>,
}

#[with_grad]
fn my_func<B: Backend>(val: &StructWithConstants<B>) -> Tensor<B, 2> {
    val.x.mul_scalar(5.0)
}

fn main() {
    type B = NdArrayBackend<f32>;

    let x: Tensor<B, 2> = Tensor::random([2, 3], Distribution::Standard);
    let my_struct = StructWithConstants {
        x: x,
        plain_val: 71, // usize automatically implements Constant
        other_val: ConstantCell::new(vec![8, 9]), // Vecs can have variables, so we use a ConstantCell to guard it
    };
    let (val, grad) = my_func_with_grad(&my_struct);
    println!("{:?}", val);
    // Note that the constnat fields of the gradient struct are their original values
    // a more mathematical interpretation would be to make them zeros, but this formalism
    // is a bit more useful for understanding the constants of the funciton that was differentiated
    // and keeps types consistent since not every type has a "zeros" value.
    println!("{:?}", grad);
}

Current limitations

This project is largely a proof of concept. As such there are a number of limitations in its current state.

And there are probably more limitations than those listed at present.

antimora commented 1 year ago

Linking an existing request for functional API (#120) by @makroiss (tagging here for input).

nathanielsimard commented 1 year ago

I'm definitively interested in improving the burn API with functional features, and I've already started working on some parts. There have been a lot of changes merged since the v0.5.0 realese, which is the one you based your work on. Firstly the Optimizer and Module traits are now more "functional", you can use them similarly to your implementation:

let grads = loss.backward();
let model = optimizer.update_module(model, grads); // Note that the optimizer take ownership of both model and grads.

However, it still uses the backward function to calculate gradients, which we could replace with your derive.

let (model, grads) = my_func_with_grads(model, inputs);
let model = optimizer.update_module(model, grads);

I'm not sure we need another layer of abstraction for the optimizers (GradientOps) though. The current implementation has been improved and makes it really easy to implement an optimizer:

 fn update_tensor<const D: usize>(
        &mut self,
        id: &ParamId,
        tensor: Tensor<B, D>,
        grad: Tensor<B::InnerBackend, D>,
    ) -> Tensor<B, D> {
    Tensor::from_inner(tensor.inner() - grad.mul_scalar(self.learning_rate))
}

Although it could be improved by having the tensor and the grad on the same backend (not wrapped inside an autodiff backend).

 fn update_tensor<const D: usize>(
        &mut self,
        id: &ParamId,
        tensor: Tensor<B, D>,
        grad: Tensor<B, D>,
    ) -> Tensor<B, D> {
    tensor - grad.mul_scalar(self.learning_rate)
}

The current trait allows optimizers to keep a state, and also save and load it from a file for checkpointing.

The `DifferentiableVariable` derive could be integrated with the `Module` derive, since modules in burn are really just containers for `Tensors`. However, I'm not sure to understand the `DifferentiableVariable` trait. The updated `Module` trait has a new `map` function that is generated automatically with the `Module` derive.

```rust
trait Module
...
    fn map<M: ModuleMapper<Self::Backend>>(self, mapper: &mut M) -> Self;
}

    pub trait ModuleMapper<B: Backend> {
        fn map<const D: usize>(&mut self, id: &ParamId, tensor: Tensor<B, D>) -> Tensor<B, D>;
}   

This means you can map each tensor contained in a module however your like. This is actually the primary way the optimizers are now implemented (and dispatch to the update_tensor function shown earlier). I believe this is better than having each diffirentiable container support a wide range of operations, even if those implementation are generated. I'm very interested in your feedback about this map function and what other methods like this could be useful.

About the non-differentiable types, I've rewritten burn-autodiff entirely since v0.5.0 😅(mainly for performance and maintainability reasons) and now there is a concept of tracked and untracked nodes. The method require_grad() can be called on any tensor. All tensors wrapped inside the Param struct are automatically tracked, since require_grad() is called in the From implementation. Constant values just needs to implement Clone. If you want the values to still be included in the module state, we could use something like ConstantCell. There is already a version of this implemented for BatchNorm2d, since running metrics needs to be saved, but not tracked by the autodiff engine.

As you can probably see, there have been a lot of changes in the burn API that were made to make it more functional. I appreciate the efforts you put into burn and I think we could work together to improve it futher. I'm interested in integrating a variant of the #[with_grad] proc macro, which could pave the way for higher order autodiff.

jmacglashan commented 1 year ago

Thanks for the feedback!

Regarding GradientOps: I agree that it would be nice to lose this requirement. It's the most verbose of the traits I defined. Whereas others have small method/type contracts, GradientOps has a handful and would probably get bigger if it stayed in the picture.

In general, I think a tensor map function is a great idea. It was also on my list of things to try in my differentiable type space, but I decided to share what I had to get more feedback before venturing down the rabbit hole further :p I'm glad to see a tensor mapping paradigm already being integrated into Burn and I definitely think it's a good way forward.

I think for a map structure approach to be an ideal replacement for GradientOps in the functional setting I've proposed, it would need the following properties:

I think the first is possible. For example, if there was a MapTensor trait (similar to the ModuelMapper trait in the current version of Burn), you could have macro usage like

tensor_op!(|x| x.mul_scalar(2.0))

automatically generate and return in place a MapTensor impl like:

{
    struct OpStruct;
    impl MapTensor for OpStruct {
        fn map_tensor<B: Backend, const D: usize>(x: &Tensor<B, D>) -> Tensor<B, D> {
           x.mul_scalar(2.0)
        }
    }
    OpStruct
}

(I'm not sure that precise method works. I'm a little fuzzy on Rust definitions of locally scoped structs. Nevertheless, I suspect something like this is doable.)

That only gets us part of the way though because to replace GradientOps you need to work on at least binary inputs (the model and its gradient). If we could support binary (or n-ary) maps, similar to tf.nest.map_structure then we could do things like provide the gradient of a struct as an operand to iterate.

I'm not sure at the moment precisely how to go about implementing that, but perhaps something like the following could be worked out.

for i in 1..=50 {
    let (loss, model_grad) = loss_fn_with_grad(&model, &x, &label);
    // inline SGD, but could be done by an optimizer too
    model = model.map_struct(
        model_grad, // provide an additional struct to map over
        // a is a tensor from self and b a tensor from model_grad
        // at the same structure position
        tensor_op!(|a, b| a - b.mul_scalar(0.01))
    )
    println!("Step {} loss: {:?}", i, loss);
}

To make that work, I think you mostly just need a way to recursively iterate over the tensor leaves of a struct and then you zip the iteration of the two structs and apply the tensor map function.

Based on what you showed and looking around the repo a bit more, I believe the current ModuleMapper approach gets around the need for a binary map operation for coordinating the updates by having Param types with a ParamId. For using gradients in the map, you wrap the gradients lookup table in the ModuleMapper, iterate the model, and look up the corresponding gradient by it's ParamId. (Correct me if I'm wrong.)

However, I think it would be nice if we didn't have cross reference data with param ids. If we can use the structure itself, that feels more direct and clear to me.

If a binary (or n-ary) mapping can be made to work so that param id's are not needed, that raises the question: would Module need a Param concept at all? Param is fairly light weight, but if we ultimately didn't need it, that would simplify the API and make reasoning about updates a little more direct and clear.

Beyond cross referencing gradients with parameters, I think the other thing the Param wrapped tracking is being used for is serialization support. That's possibly a good reason to have it even if gradient updates could be handled by the struct organization and binary map functions. However, if regular Tensors supported serde serialization (I think they don't right now?), perhaps Param types wouldn't be needed for that either and just regular tensor fields and serde serialization on the type/Module would be sufficient? Not sure if there are issues preventing that.

jmacglashan commented 1 year ago

Looks like n-arary map_structures and using the above described scheme will indeed work!

That is, we can define:

pub trait TensorsOp {
    fn map_tensor<B: Backend, const D: usize>(&self, inputs: &Vec<&Tensor<B, D>>) -> Tensor<B, D>;
}

pub trait MapStructure {
    fn map_struct(inputs: Vec<&Self>, tensor_op: &impl TensorsOp) -> Self;
}

impl<B: Backend, const D: usize> MapStructure for Tensor<B, D> {
    fn map_struct(inputs: Vec<&Self>, tensor_op: &impl TensorsOp) -> Self {
        tensor_op.map_tensor(&inputs)
    }
}

#[macro_export]
macro_rules! tensors_op {
    (|$input:ident| $body:expr) => {
        {
            struct _TensorOpStruct;
            impl TensorsOp for _TensorOpStruct {
                fn map_tensor<B: Backend, const D: usize>(&self, $input: &Vec<&Tensor<B, D>>) -> Tensor<B, D> {
                    $body
                }
            }
            _TensorOpStruct
        }
    };
}

To handle the n-ary arguments, map_tensor and map_struct take a vec of all the arguments. I haven't yet written the derive, for implementing MapStructure but it should be very simple.

An illustrative example of using these facilities is the following:

#[derive(Debug)]
struct MyStruct<B: Backend> {
    x: Tensor<B, 2>,
    y: Tensor<B, 1>,
}

// this would ultimately be covered by a derive that is pretty simple to implement.
impl<B: Backend> MapStructure for MyStruct<B> {
    fn map_struct(inputs: Vec<&Self>, tensor_op: &impl TensorsOp) -> Self {
        MyStruct {
            x: MapStructure::map_struct(inputs.iter().map(|el| &el.x).collect(), tensor_op),
            y: MapStructure::map_struct(inputs.iter().map(|el| &el.y).collect(), tensor_op),
        }
    }
}

fn main() {
    type B = NdArrayBackend<f32>;
    // make two structs we want to use in a binary operation
    let my_struct1: MyStruct<B> = MyStruct{x: Tensor::ones([2, 3]), y: Tensor::ones([4]).mul_scalar(2.0)};
    let my_struct2: MyStruct<B> = MyStruct{x: Tensor::ones([2, 3]), y: Tensor::ones([4])};

    // apply a binary map_struct that multiples them together
    let result = MapStructure::map_struct(
        vec![&my_struct1, &my_struct2],  // to make it a binary op, we just provide both structs in a vec
        // the tensors_op! macro lets you inline a tensor operation. It expands to making a new (scoped) struct
        // that impls TensorsOp with the body of the closure-like expression.
        &tensors_op!(|inputs| inputs[0].mul(inputs[1]))
    );

    // the result is as expected, a MyStruct that is the multiplication of the two input tensors
    println!("{:?}", result);

}

With that functionality, I think GradientOps can indeed be removed and optimizers can be more simply implemented by using map_struct with regular n-ary tensor operations.

I'll see if I can close the loop on that later.

nathanielsimard commented 1 year ago

The MapStructure is really similar to the Module trait, I would look at this before doing more developement! Also note that, most of the time, function definition should take ownsership of tensors. With the recent changes, no operation can be executed on tensor reference, if you pass around reference, you will need to clone everytime each tensor, which remove the opportinuty for tensor's data to be reused inplace.

jmacglashan commented 1 year ago

It is quite similar indeed! The reason for the difference was I was trying to (1) see if it was possible to map it without relying on Param indexing, and (2) support easy n-ary map operations.

However, your point about taking ownership is fair and it's probably unreasonable to work without params (at least at the moment that would require a lot more thought and it may not be reasonable).

So let me refocus my efforts. Here's what I'm currently thinking I'll look into next:

  1. See if I can make an nary-map and tensor op macro building on the existing Module and map system (or as close to the current map system as possible)
  2. Look into making every Module implementing struct automatically support the DifferentiableVariable and with_grad capabilities (leaving out GradientOps since that could be replaced by the map operations with macros)

Does that sound reasonable to you?

The next big block of time I have to take look at this is probably the weekend, and maybe a little during the evening of the remaining weekdays.

nathanielsimard commented 1 year ago

Yes this seems reasonable to me, I'm still unsure if you need DifferentiableVariable on the Module trait, since it could be included with the mapper instead, but I may be wrong 😅.

jmacglashan commented 1 year ago

I think using a mapper in some form is likely how I would implement DifferentiableVariable for a module, but DifferentiableVariable as its own trait makes it a little more amenable to other non-module types, such as regular tensors and containers of tensors (e.g., vecs of tensors, tuples of tensors, options of tensors, etc.), which I don't think are currently modules because they don't have Param wrappings. (Correct me if I'm wrong.)

nathanielsimard commented 1 year ago

You have the right intuition, but you can make any primitive type a module by manually implementing Module for Param<Type>. So as of now, Param<Tensor> actually implements module. It's the same for Vec and Option, so it seems to be really similar to your design after all 😅. If ParamId was not necessary, we could implement Module directly for Tensor, but we would lose the ability to save and load them. Recursive datastructures are really beatiful 😁.

jmacglashan commented 1 year ago

They're definitely quite similar :)

I do think It would be really nice to have all the concepts work on either plain tensors (with no Param wrapping) or structures of them (via a module), which is why I would make DifferentiableVariable its own trait -- it means you can operate on just plain tensors or modules with the same functional gradient API.

Question related to serialization: it seems like you could easily add standard serde serialization to structs of tensors (and therefore module like objects without Params wrappings) if Tensor supported serde itself, but I don't think it does. What's the reason it can't be implemented for Tensors?

antimora commented 1 year ago

They're definitely quite similar :)

I do think It would be really nice to have all the concepts work on either plain tensors (with no Param wrapping) or structures of them (via a module), which is why I would make DifferentiableVariable its own trait -- it means you can operate on just plain tensors or modules with the same functional gradient API.

Question related to serialization: it seems like you could easily add standard serde serialization to structs of tensors (and therefore module like objects without Params wrappings) if Tensor supported serde itself, but I don't think it does. What's the reason it can't be implemented for Tensors?

@jmacglashan i had a similar question about serialization in this ticket https://github.com/burn-rs/burn/issues/234 is this what you are thinking as well?

nathanielsimard commented 1 year ago

We could definitively split the Module trait into smaller more specialized traits.

Tensor can be serialized now, it just needs to be extracted into data first, though it could be hidden in a simple function like save.

But I think it may not be clear why we need param_id so let's clarify this a bit and we way converge to a better solution. The serialized file struture of burn modules is recursive, similar to a json file that would look like that:

{
  "parent_module": {
    "child_module" {
      "id": "paramId"
      "data": [1, 2, 3,....],
      "shape": [2, 3, 3],
    }
  }
}

However, each module is unaware of where it is in the tree, making it harder for optimizers to keep a state for each tensor. To solve that, we have a simple mapping with param id that can be loaded from the module state. So the optimizer states can be represented using simple HashMaps. Some work could be done to update how optimizers keep their state for each diffentiable variable, and across different training runs (to restart training from a checkpoint).

jmacglashan commented 1 year ago

@jmacglashan i had a similar question about serialization in this ticket https://github.com/burn-rs/burn/issues/234 is this what you are thinking as well?

@antimora Yeah that seems like a similar question to mine :)

Tensor can be serialized now, it just needs to be extracted into data first, though it could be hidden in a simple function like save.

But I think it may not be clear why we need param_id so let's clarify this a bit and we way converge to a better solution. The serialized file struture of burn modules is recursive, similar to a json file that would look like that:

{
  "parent_module": {
    "child_module" {
      "id": "paramId"
      "data": [1, 2, 3,....],
      "shape": [2, 3, 3],
    }
  }
}

However, each module is unaware of where it is in the tree, making it harder for optimizers to keep a state for each tensor. To solve that, we have a simple mapping with param id that can be loaded from the module state. So the optimizer states can be represented using simple HashMaps. Some work could be done to update how optimizers keep their state for each diffentiable variable, and across different training runs (to restart training from a checkpoint).

Thanks for the explanation! I think if we adopt gradients being held in the same module struct as in this functional API, that would also allow us to make optimizers be generic to the model/module and all the optimizer variables similarly be instances of the same module struct. From that, the optimizer would know the structure and allow standard serialization without needing Param objects to bookkeep. E.g.,

struct SGD<M: Module> {
    // momentum variable is same type as module and gradient
    // this allows the optimizer to maintain the structure hierarchy without Param id lookup
    momentum: Option<M>,   
}

impl<M: Module> SGD<M> {
    fn update(&mut self, model: M, grad: M) -> M { // update method might want to be from an optimizer trait
        // because model, grad, and momentum are all same type, we can apply mapped operations on them
    }
}

Let me know if you think I'm missing something though!

nathanielsimard commented 1 year ago

I'm not sure we should bind an optimizer to a module. I'd like burn to be as flexible as possible, therefore allowing modules to change their paramters during runtime and using the same optimizer on potentially multiple modules. Maybe we could set an id into each tensor instead of the wrapper struct Param. It could be optional and manually set?

jmacglashan commented 1 year ago

What use case are you imagining where you want to change the kind of model the optimizer is optimizing in the middle of training? If it's a stateful optimizer (where the state depends on the model weights like momentum terms), you can't reuse its state for a new kind of model anyway. Seems like a footgun to allow a wholly new kind of model suddenly be used in an existing optimizer for a different model because it's unclear what the behavior for that should be. And if the desired behavior is "make new state variables" then that's what making a new optimizer does anyway.

Just to add a couple of points of clarity:

Beyond removing the dependency on Params, I'll add that the functional API I'm proposing really would want an optimizer that works on the types themselves. One of the core motivations is not to have a bookkeeping container as barrier between you and the gradients, states, etc. It all would use the structure you define yourself with basic operations you can apply directly to those structures.

Maybe we could set an id into each tensor instead of the wrapper struct Param. It could be optional and manually set?

Possibly. It might be cumbersome to manually set it, but if you had a scheme that did it automatically while keeping it consistent that seems like it would be an improvement. I'd say use a UUID, but you'd need the IDs associated with a module to be consistent across updates, checkpoint loading, etc., so I'm not sure a UUID is the right way to do it.

nathanielsimard commented 1 year ago

Well, I'm thinking of adding parameters to an existing module, probably via a list of modules, but yes, it can still be the same type, so it's true that "binding" to a model type will not remove that possibility! For now, the state is associated with parameter IDs, so for a new parameter, the momentum will start from the beginning and won't be impacted by the other ones. Therefore, you could reuse the same optimizer for different modules with different parameters, and everything would be fine since there is no state sharing between parameters. This is cool, but at the same time, for some applications, you may want a state to be shared between parameters, no matter when they were created (it changes nothing when the parameters are fixed from the beginning). So it's not perfect in terms of flexibility.

I'm not against your proposal; it's definitely useful work in improving the Optimizer API. However, I'm still not convinced that we should use the Module trait as state for optimizers, since a module can have other fields that are not parameters (e.g., dropout layers, activation functions, configs, running states like in batch norm, etc.).

For now, the wrapper struct Param is used during code generation to call the right function (e.g., to_device, detach, state, etc.). I'm curious how you would distinguish between trainable parameters and everything else during codegen?

nathanielsimard commented 1 year ago

Maybe having an annotation like #[param] could do the trick.

jmacglashan commented 1 year ago

Yeah if you want to do something like progressive nets where you add a layer to a vec of a previous model I think that could still be supported. While that could be supported, I might still be inclined to "suggest" (to the extent that you can suggest any patterns through API design :p ) to have the user make a new optimizer where they deliberately choose what parts of the state they want to keep, because you can get very different learning dynamics when some parts of the net have historical variables and other parts don't. If the user is explicit about what optimizer state variables they copy over, there won't be any surprises in how training proceeds.

Of course you would also want to make it easy for a user to be able to choose how they copy state over if they need to. Fortunately though, that's one of the advantages of everything belonging to the struct the user made. They know precisely which optimizer variables correspond to which parts of the network. They'll even get IDE autocomplete and can be sure they're copying what they want!

I'm not against your proposal; it's definitely useful work in improving the Optimizer API. However, I'm still not convinced that we should use the Module trait as state for optimizers, since a module can have other fields that are not parameters (e.g., dropout layers, activation functions, configs, running states like in batch norm, etc.).

For now, the wrapper struct Param is used during code generation to call the right function (e.g., to_device, detach, state, etc.). I'm curious how you would distinguish between trainable parameters and everything else during codegen?

Maybe having an annotation like #[param] could do the trick.

Using #[param] annotations for fields was originally what I was thinking when I started down this path too, and I could still be convinced to go that way. However, I later changed plans in a direction that resulted in a much simpler implementation for the various traits and the derive macro.

That is, I introduced the Constant marker trait (though I'd be flexible on the name of it). Types that impl Constant get implementations for the three other traits where they basically just clone themselves and don't do anything. That means if you get a gradient, or an optimizer variable, it just holds onto that original value in the model. E.g., a Dropout layer with a f32 dropout rate field just holds its value when you get the Dropout struct that defines the gradient. Same for tensor mapping operations -- the fields that impl Constant have their value copied into the mapped struct. (I think this is the same behavior for the current ModuleMapper)

Having the non-parameter values retain their original value in the gradient struct binding has some advantages beyond making the traits and derive easy to implement. E.g., if you were looking at the gradient of a dropout layer, it's kind of nice to know immediately what the dropout rate was by looking at its field in the gradient -- you don't have to go back to the model to find out. You could even implement custom special purpose optimizers that were sensitive to that field or even update it with some custom operation if you wanted to.

The plan was all primitive Rust types that cannot be containers of Tensors (e.g., f32, bool, etc.) would impl the Constant trait, because it's not possible to differentiate or do tensor operations on them regardless. And for new non-parameter types a user makes, such as a config struct, the user can similarly just one-line impl Constant for MyType {} to have it work. Or they can put it in a ConstantCell wrapper (which also lets you have non-parameter tensors if you need them)

The one limitation of constant values cloning themselves is if the type was big in sized memory, you may not want to clone that. I think that's almost never the case for these kinds of non-parameter fields (e.g., an f32 isn't problematic to copy), but there is also easy escape hatch for that. You can make a potentially big non-parameter fields that you don't want clone in an Option-like Enum that on the gradient binding (or other tensor operations) changes its value to the None value.

I suspect that kind of Option-like wrapper would almost never be used, because I suspect most non-parameter fields are either simple values that are easy to clone (e.g, f32) or on the heap where the clone just clones the pointer, but it allows you to get that behavior for any edge cases where you might need it.

And of course, if you want some really special behavior for some non-parameter field, you can manually impl the traits for your field's type to behave the way you want.

nathanielsimard commented 1 year ago

The difference between choosing to mark param and constant is mosly which one is the default, and I believe it would be nice to have a common trait to implement for both cases. ConstantCell or a #[constant] annotation works for values that might be differentiated, but you dont want to.

Maybe having to explicitly define param is less friendly than explicitly defining constant, since there will be way more params than constants. I like it.

I don't thinking cloning is a problem here, since if it's big you can wrap the struct into an Arc.

About the optimizer with a dynamic model, I'm not sure cloning the optimizer state will be very user friendly. If you are deep into a model and want to add a new layer dynamicaly, you may not have access to the optimizer in that scope. Optimizers are normaly decoupled from the model definition, however you normally have to register parameters, but it can be done automatically, so you will never see "ha I forgot to register my params that's why it doesnt learn".

Maybe your solution won't compile if you don't bind the optimizer with its model, so that would be better, but can you clarify that part, I'm not sure I really get it?

jmacglashan commented 1 year ago

The difference between choosing to mark param and constant is mosly which one is the default, and I believe it would be nice to have a common trait to implement for both cases. ConstantCell or a #[constant] annotation works for values that might be differentiated, but you dont want to.

Maybe having to explicitly define param is less friendly than explicitly defining constant, since there will be way more params than constants. I like it.

Cool. Yeah params being the common case is my feeling as well.

I don't thinking cloning is a problem here, since if it's big you can wrap the struct into an Arc.

Sounds good to me too!

About the optimizer with a dynamic model, I'm not sure cloning the optimizer state will be very user friendly. If you are deep into a model and want to add a new layer dynamicaly, you may not have access to the optimizer in that scope. Optimizers are normaly decoupled from the model definition, however you normally have to register parameters, but it can be done automatically, so you will never see "ha I forgot to register my params that's why it doesnt learn".

There's probably just a subjective preference regarding the optimizer continuing automatically or not. What to do with an optimizer when you change your model feels like it's undefined. As in, there are lots of ways a researcher/practitioner might decide to handle that with none of the different choices being more a priori "right" than the other.

My time in ML has made me somewhat adverse to libraries choosing for you automatically in such cases so that it runs. Automatic choosing so it runs is almost always the source of unexpected bugs and I prefer APIs to ask you to explicitly opt in when you do something where there are multiple ways to interpret what is right.

But I also am probably on the more cautious side of that. For example, I tend to think implicit broadcasting should be considered a bug more than a feature, but lots of people disagree with me even as I watch them get burned by it :p

There could also be a middle ground where we require explicit opt in, but we make it really easy for some standard cases. E.g., for structs with vec fields of layers, we can make it so there is a simple method you can call to expand the vector fields to match a new module, but you have to choose to call it on the state and the optimizer gets unhappy if you don't :p But mostly I'm spitballing at the moment and haven't thought that through very far.

Maybe your solution won't compile if you don't bind the optimizer with its model, so that would be better, but can you clarify that part, I'm not sure I really get it?

If the struct type of the model changes, my approach would statically fail. If it was that a vector field changed in size, right now it would runtime fail. I'm not sure there is a way to make it statically fail on differently sized vecs since to do that would seem to be at odds with even allowing vecs in the first place!

If you want to see what I have implemented at the moment, the optimizer trait and a stateful example looks like this:

https://github.com/jmacglashan/burn-functional/blob/update_to_latest_burn/src/optimizers.rs

pub trait Optimizer<M: StructsOp<Input = M>> {
    fn update(&mut self, model: M, grads: M) -> M;
}

pub struct SGDWithMomentum<M> {
    lr: f32,
    beta: f32,
    momentum: Option<M>,
}

impl<M> SGDWithMomentum<M> {
    pub fn new(lr: f32, beta: f32) -> Self {
        Self {
            lr,
            beta,
            momentum: None,
        }
    }
}

impl<M: StructsOp<Input = M> + Clone> Optimizer<M> for SGDWithMomentum<M> {
    fn update(&mut self, model: M, grads: M) -> M {
        let scaled_grads: M = StructsOp::apply(vec![grads], &MulScalarOp(-self.lr));
        if let Some(mom) = self.momentum.take() {
            let scaled_mom: M = StructsOp::apply(vec![mom], &MulScalarOp(self.beta));
            self.momentum = Some(StructsOp::apply(
                vec![scaled_mom, scaled_grads],
                &tensors_op!(|a, b| a + b),
            ))
        } else {
            self.momentum = Some(scaled_grads);
        }
        StructsOp::apply(
            vec![model, self.momentum.as_ref().unwrap().clone()],
            &tensors_op!(|a, b| a + b),
        )
    }
}

:warning: Note that the above code is coming from a branch where I've been updating my PoC to work and integrate with the current version of Burn. I'll have more to say on that later as I get further along in it.

The relevant part in it is StructsOp is like ModuleMapper for doing operations between the state/gradient/module, except it works on regular tensors and doesn't require Param ids. StructOps is a more specialized smaller trait so the optimizer only depends on that bound, but if this proposal pushed through I would imagine it being either part of the Module or a bound on it.

Hopefully from that above it's clear why it would statically fail if you changed the model struct: the optimizer is generic to it and you can't pass into its update method a new struct type.

For modules that change by increasing the size of some vector field rather than by type, it wouldn't statically fail. But what matters is how StructOps handles when you have two different vecs with different sizes: the previous state and the new gradients and module. Right now, that would fail when it tries to get the variable for the smaller vec. (And I had planned to put better error checks around it).

The "opt in" approach would be something like "I increased the layers in my model vec, let me call a method on the states of the optimizer that will expand them." or something like that.

nathanielsimard commented 1 year ago

There's probably just a subjective preference regarding the optimizer continuing automatically or not. What to do with an optimizer when you change your model feels like it's undefined. As in, there are lots of ways a researcher/practitioner might decide to handle that with none of the different choices being more a priori "right" than the other.

My time in ML has made me somewhat adverse to libraries choosing for you automatically in such cases so that it runs. Automatic choosing so it runs is almost always the source of unexpected bugs and I prefer APIs to ask you to explicitly opt in when you do something where there are multiple ways to interpret what is right.

I strongly agree with this, current optimizers may decide which strategy they use, but someone should be able to implement a new optimizer with their strategy and everythink should work as they intended.

But I also am probably on the more cautious side of that. For example, I tend to think implicit broadcasting should be considered a bug more than a feature, but lots of people disagree with me even as I watch them get burned by it :p

I'm a bit less extreme, but I can feel your pain 😅

I still think we should not have another trait of operations such as GradientOps or StructOps. Maybe the problem of the parameter id can be "fixed" by using named tensors.

pub trait NamedTensorVisitor<B: Backend> {
    fn visit<const D: usize>(&mut self, name: String, tensor: &Tensor<B, D>);
}

We could have two kinds of visitor: TensorVisitor and NamedTensorVisitor. The name could be generated by the derive macro, it would take more computation than a fixed param id, but it might be negligeable. The name is the path of the tensor from the parent module, the same path in the JSON file shown previously. mymodel.linear-1.bias for example.

Optimizers could be configured a certain way on new parameter names. So you could control what behavior you want when an optimizer encounter new parameters names during training.

jmacglashan commented 1 year ago

Sorry for the slow reply. It's been a busy week!

I strongly agree with this, current optimizers may decide which strategy they use, but someone should be able to implement a new optimizer with their strategy and everythink should work as they intended.

Cool, glad to hear it! :)

I still think we should not have another trait of operations such as GradientOps or StructOps.

To be clear, I'm on board with eliminating GradientOps and replacing it with a mapping-like trait that let's you easily define custom tensor operations to apply to structs, rather than rely on multiple trait methods for operations. That is what StructOps was doing -- it's a single method trait that replaced the previous need I had for GradientOps.

That said, I'm open to alternative implementations of what I currently have for StructOps as long as it well captures the ability to easily apply operations to structures and can be implemented down to the tensor level.

Maybe the problem of the parameter id can be "fixed" by using named tensors.

pub trait NamedTensorVisitor<B: Backend> {
    fn visit<const D: usize>(&mut self, name: String, tensor: &Tensor<B, D>);
}

We could have two kinds of visitor: TensorVisitor and NamedTensorVisitor. The name could be generated by the derive macro, it would take more computation than a fixed param id, but it might be negligeable. The name is the path of the tensor from the parent module, the same path in the JSON file shown previously. mymodel.linear-1.bias for example.

Optimizers could be configured a certain way on new parameter names. So you could control what behavior you want when an optimizer encounter new parameters names during training.

Paths seems like a reasonable direction to provide some additional positional information into the process and to allow for more special case operations. I'd be on board with that.

One thing that's not as clear to me with the trait you listed above is how to implement the trait if you wanted to do some tensor math between structures. One of the things I was striving for with the StructsOp was to make it easy to apply operations to two structs of the same type, similar to how you would if you applied to tensors.

E.g., if I wanted to add the tensors of two structs together, s1 and s2, I can write

let added_struct: MyStruct<B> = StructsOp(
    vec![s1, s2],
    &tensors_op!(|a, b| a + b)
)

And since the latest version of Burn switched to taking ownership, the above scheme also let you take ownership of the tensors to apply the operation without clones (and reuse memory when possible). If we used something like the visitor pattern you showed above, would there be a way to do that? Or maybe there's a similar kind of macro that could be supported around it to make it easy to do that?

nathanielsimard commented 1 year ago

The struct that implements NamedTensorVisitor can have a state, which is available everywhere, but it won't be that pretty using it.

I'm also thinking a lot about making an additional struct with the Module macros:

#[derive(Module)]
struct MyModule<B: Backend> {
   weight: Tensor<B, 2>,
   bias: Tensor<B, 1,
   some_number: usize,
}

// created by the derive
struct MyModuleParams<T>{
   weights: T,
   bias: T,
}

The new struct MyModuleParams would allow for simpler serializing/deserializing of whatever values you want to attribute to parameters MyModuleParams<DataSerialize<E>> to save weights or MyModuleParams<usize> to save a counter (usefull with opitmizer). We could also use it to store gradients, but we would need Any in that case MyModuleParams<Box<dyn Any>>, this replaces the current HashMap solution : HashMap<ParamId, Box<dyn Any>>. Then we could have a map_with function that takes module and params structs to iterate over gradients and module parameters at the same time.

We could create another struct without type erasing for gradients, so we would have MyModuleParamsContainer<T> and MyModuleParams<B>. We could also create an alias type MyModuleState<E> = MyModuleParamsContainer<DataSerialize<E>>. We could also add another struct (hehe) to store constant, so MyModuleConstant<B>, so that we could.

let module = ...
let grads = ...

let (params, constants) = module.into_split();
let params = params.zip(grads).map(|param, grad| param + grad);

let module = MyModule::from_split(params, constants);

This is not a full proposal, just an idea 😅.

jmacglashan commented 1 year ago

Deriving a separate struct to hold just params (and possibly another for the constants) does have its virtues. In particular, it avoids cloning the constants when you don't need them and it can make the serialization smaller as you noted.

The trade offs I see though are:

  1. It adds another type for users to think about.
  2. May make the consistency with plain tensors and containers of tensors or modules more complicated to manage. (I think doable, but it feels a little awkward to either add the additional types to them for consistency with the module usage or to special case them if not.)
  3. May make some structs/functions that are generic over a module a little more complicated if you have to add constraints about the params/constant types of the module.
  4. If you only serialize the params, to deserialize a model, you have to know the choices for the constant fields.

Those trade offs might be worthwhile if the constants fields end up cumbersome, but my feeling is they'd typically be small.

I am wondering though if we can add a bit more special casing opportunities to the kind of StructsOp style I proposed if I added to it the path info that you suggested. I might at least play around with that idea a bit and see where it goes.

jmacglashan commented 1 year ago

Okay, I worked out a scheme by including a path in the operations like you suggested that provides a lot of flexibility and would let you do things like automatic reinitialization of optimizer variables.

I worked these into my StructsOp trait for prototyping, but I imagine if this was incorporated it wouldn't be its own trait but part of the Module trait and could replace the Module::map and Module::visit methods since it's similar to those methods in spirit.

The new relevant methods look like:

pub trait StructsOp {
    fn apply_at(inputs: Vec<Self>, tensor_op: &impl TensorsOp, path: &mut PathBuf) -> Self
    where
        Self: Sized;

    fn consume_at(self, consumer: &mut impl TensorConsumer, path: &mut PathBuf);
}

There are also some helper default implementations apply and consume that auto populate the root path for you, which is how a user would call it in general. (There is also a utility named map for when you are only operating on a single struct and mapping it).

The traits it depends on are:

pub trait TensorsOp {
    fn apply<B: Backend, const D: usize>(
        &self,
        inputs: Vec<Tensor<B, D>>,
        path: &PathBuf,
    ) -> Tensor<B, D>;
}

pub trait TensorConsumer {
    fn consume<B: Backend, const D: usize>(&mut self, tensor: Tensor<B, D>, path: &PathBuf);
}

I used a PathBuf instead of a String because it it has some convenient methods for adding and popping paths to it. Being a buf, it can also be reused through the whole pipe. (There are convenience methods apply_child and consume_child which lets you specify as a string the extension to the path and it will automatically push the child before applying and then pop it afterwards).

What that means is if you want to collect and index all the tensors in a module, we can implement TensorConsumer for TensorContainer<String> in Burn and add a convenience collect function:

impl TensorConsumer for TensorContainer<String> {
    fn consume<B: Backend, const D: usize>(&mut self, tensor: Tensor<B, D>, path: &PathBuf) {
        let path_str = path.to_str().unwrap().to_string();
        self.register(path_str, tensor);
    }
}

pub fn collect<T: StructsOp>(val: T) -> TensorContainer<String> {
    let mut container = TensorContainer::new();
    val.consume(&mut container);
    container
}

So if you have some module, you can easily collect its tensors with let container = collect(my_module);

And with that functionality, we can implement a lazy updater for optimizers that will create newly initialized optimizer variables from gradient if it wasn't in a previous state and otherwise update the previous state with the new gradient.

If we take the SGD with momentum example, we get the following

LazyUpdater code

struct LazyUpdater<I, U> {
    intialize: I,
    update: U,
    state: TensorContainer<String>,
}

impl<U: TensorsOp, I: TensorsOp> LazyUpdater<I, U> {
    pub fn with_state<M: StructsOp>(intialize: I, update: U, state: Option<M>) -> Self {
        let container = match state {
            Some(s) => collect(s),
            None => TensorContainer::new(),
        };
        LazyUpdater {
            intialize,
            update,
            state: container,
        }
    }
}

impl<I: TensorsOp, U: TensorsOp> TensorsOp for LazyUpdater<U, I> {
    fn apply<B: Backend, const D: usize>(
        &self,
        mut inputs: Vec<Tensor<B, D>>,
        path: &PathBuf,
    ) -> Tensor<B, D> {
        if let Some(tensor) = self.state.get::<B, D>(&path.to_str().unwrap().to_string()) {
            inputs.push(tensor);
            self.update.apply(inputs, path)
        } else {
            self.intialize.apply(inputs, path)
        }
    }
}

SGD With Momentum using LazyUpdater

pub struct SGDWithMomentum<M> {
    lr: f32,
    beta: f32,
    momentum: Option<M>,
}

impl<M> SGDWithMomentum<M> {
    pub fn new(lr: f32, beta: f32) -> Self {
        Self {
            lr,
            beta,
            momentum: None,
        }
    }
}

impl<M: StructsOp + Clone> Optimizer<M> for SGDWithMomentum<M> {
    fn update(&mut self, model: M, grads: M) -> M {
        let (lr, beta) = (self.lr, self.beta);
        let updater = LazyUpdater::with_state(
            tensors_op!(|g; lr: f32| g.mul_scalar(-lr)),
            tensors_op!(|g, mom; lr: f32, beta: f32| g.mul_scalar(-lr) + mom.mul_scalar(beta)),
            self.momentum.take(),
        );
        self.momentum = Some(StructsOp::map(grads, &updater));
        StructsOp::apply(
            vec![model, self.momentum.as_ref().unwrap().clone()],
            &tensors_op!(|a, b| a + b),
        )
    }
}

(Note that I also recently added the ability to provide context to tensors_op! derived operations which are optionally on the rhs of a ; in the argument list. This makes the usage even simpler.)

So with that, updaters can now (optionally) handle dynamically growing models while preserving the other benefits. It also opens the door for a lot of other ways to map/manipulate modules with special path-dependent rules.

nathanielsimard commented 1 year ago

The real problem comes from serialization and the need for ParamId or PathBuf. So we only need to replace TensorContainer with one new type MyModuleContainer<T>.

trait Module<B: Backend> {
    type Container<T>;

    fn map<M: Mapper<B>>(self, mapper: &mut M) -> Self;
    fn map_with<S, O, M: MapperState<B, S, O>>(self, state: Self::Container<S>, mapper: &mut M) -> (Self, Self::Container<O>);
    // Instead of state
    fn serialize(self) -> Self::Container<DataSerialize<B::FloatElem>>;
    ...
}
trait Mapper<B: Backend> {
    fn map<const D: usize>(tensor: Tensor<B, D>) -> Tensor<B, D>;
}
// O ensure we can map the state to any other type if we want.
trait MapperState<B: Backend, S, O> {
    fn map<const D: usize>(tensor: Tensor<B, D>, state: S) -> (Tensor<B, D>, O);
}

Now optimizer would not be using TensorContainer anymore, but Module::Container<Box<dyn Any>>. The any is required for the same reason it is required in TensorContaienr. However we should be able to encapsule it gracefully to have a trait that looks like that.

trait Optimizer {
    type State<const D: usize>;

    fn update_tensor<const D: usize>(&self, tensor: Tensor<B, D>, state: Self::State<D>) -> (Tensor<B, D>, Self::State<D>);
}

So for an implementation:

pub struct SGDWithMomentum<M: Module> {
    lr: f32,
    beta: f32,
    momentum: Option<OptimserState<M>>,
} 

enum SGDState<S> {
    Value(S),
    Uninitiazed,
    NotRequired,
}

impl<M: Module> Optimizer for SGDWithMomentum<M: Module>{
    type State<const D: usize> = SGDState<Tensor<D, M::Backend>>;

    fn update_tensor<const D: usize>(&self, tensor: Tensor<B, D>, state: Self::State) -> (Tensor<B, D>, Tensor<B, D>) {
          // do whatever
    }

    // All auto generated does not compile and we could use a pattern similar to struct ops here.
    fn update(self, module: M) -> (Self, M) {
        struct OptimMapper(Self); // We could use a macro like `tensor_op!` that does everything

        impl MapperState<B, Box<dyn Any>, Box<dyn Any>> for OptimMapper {
            fn map<const D: usize>(tensor: Tensor<B, D>, state: Box<dyn Any>) -> (Tensor<B, D>, Box<dyn Any>) {
                let (tensor, state) = self.update_tensor(tensor, state.downcast());

                (tensor, Box::new(state))
            }
        }

        module.map_with(self.momemtum, OptimMapper::new(self));
    }
}

This is just a vague idea, but we need to fix how params are serialized before fixing how the optimizer can be refactored. There are definitively some parts of your solution that could help here, but I would prefer using associative types like Optimizer::State than using a macro, another struct or another trait.

What is your opinion on this? Do you think we could take the way you propose to navigate the struct with an optimizer API using an associative type?

jmacglashan commented 1 year ago

I do have a built-in assumption that we can add regular serde serialization to the Tensor type, which from our previous discussions you seemed optimistic we could do (but maybe I misinterpreted?). This proposal of optimizers (and Modules without parameter wrapping types) wouldn't push through if we can't support that because the the serialization. Or at least, it wouldn't be nearly as pretty :p

If Tensors can work with regular serde serialization, I don't think there is anything stopping regular Modules from being serialized with regular serde either (that is, without requiring special intermediate states).

In the optimizer I have, it's the Module type that's held onto by the optimizer struct. The TensorContainer is only temporarily used in the update method, but the result is then another Module struct. And if the optimizer struct is just holding onto simple types (like floats) and zero or of the serializable Module struct instance, the optimizer could also be serialized with regular serde.

So I guess the question is: can we implement serde Serialize/Deserialize for Tensors? If so, is there any other reason you think Modules (and then optimizers around them) without parameter wrappings couldn't in turn implement Serialize/Deserialize (I would expect it to work even with derives since they're otherwise not complex structs)

nathanielsimard commented 1 year ago

Regarding serialization, I though about it and you can see my anwser here https://github.com/burn-rs/burn/issues/234#issuecomment-1486900342. If we consider modules to only be parameters containers, then we could serialize/deserialize them, but not if they can owned any other type. We could make Tensor serializable, but it wouldn't solve the problem.

I really though deeply about what we are discussing here and I came up with multiple points that I feel are important to consider (which were not clear to me before).

  1. We cannot use the module as a container for gradients because not all parameters may have gradients at all times. Some modules may detach their parameters or only use a fraction of their parameters during a forward pass, resulting in some parameters having no gradients. While we could use zero gradient tensors, it would be inefficient and potentially harmful in very large and sparse models. Therefore, Burn would not be a suitable choice to implement such models.
  2. This is the same thing for the state, some parameters might have a state while others don't.
  3. This makes me realize that we can't have a flexible API if we perform optimization at the module level, we need it to be at the parameter level.
  4. All of the suff mentionned in https://github.com/burn-rs/burn/issues/234#issuecomment-1486900342.

I feel like this invalidates some of the proposals you made regarding the API. However, it made me think hard about the design of Burn, and there are certainly areas where we can improve significantly.

  1. The need to have parameter IDs and the asymmetry between how the module and the optimizer states are saved.
  2. Most of the time, modules will only have parameters, so we can simplify the syntax by specifying constants instead of parameters.
  3. Optimizers should be able to have a global state as well as a local state (per parameter), not just a local state (as of right now).

So thank you for the feedback you gave, it's challenging, but enriching. I'm very interested in knowing if you agree with the points above. The hard part of designing an API is not to find a simple solution, but to find a simple solution that works for all useful known use cases. I see it a bit like physics: 'Everything should be made as simple as possible, but not simpler.'

jmacglashan commented 1 year ago

Thanks for the feedback!

At a certain point there may just be a difference in general design philosophies, and you own the project, so you can certainly choose which one to follow! I'll respond to your comments below though.

Re: non-serializable fields: one example you gave in the other issue was having fields that maintain something like a database connection that cannot be serialized. If the module contains that, I agree it would be problematic. I would not personally design modules to have aspects like that regardless of the serialization, because it makes reasoning about the neural net itself rather difficult. Instead, if I needed access to a database for some data, I would design that outside/around the neural net definition itself.

Re some of the first set of points.

  1. I would consider tensor fields that are not meant to be part of the differentiation chain to just be constants, and constants already work around that by putting them in a ConstantCell. For modules that have parameters that are not always used, zero gradients seems like the right choice to me actually! Other libraries will sometimes use Nones in that case which is also okay, but I think for Rust Nones would clutter things (by requiring fields to be options) and I'm not sure it gives you much more power than zero gradients.

If you need special optimization handling of those sparse parameters beyond what an optimizer would do with zero gradients, I think it's better to avoid a typical optimizer and have the user define a custom optimizer that handles their special cases as they want. And part of the appeal of working closely with the types is it makes it easy for users to write those special cases. I guess one way to express my view is I wouldn't want to try to have optimizers etc. anticipate every possible edge case and guess the right way to handle it. I think that's a losing battle and I'd rather make it easy for a user to define their special case in their own code.

  1. By state do you mean like BatchNorm batch statistics? If so, I would similarly wrap them in a ConstantCell

I do like the three directions you'd like to pull from this discussion and think it will be an improvement. I would still prefer to have a functional API that works very closely with types all the way through optimizer level. I also believe I could make it work. However, if you're not sold on it, it's your prerogative to not follow that path. And if not, I would just build up my own crate on top of Burn. Burn's already powerful enough that all of the things I've proposed can easily build on what's there (and in fact the working prototype does just that living in its own crate), with the exception of serializaiton at the moment since that would require Tensors to implement serde Serialize/Deserialize. So if I had one additional request for Burn that would enable me to further develop these ideas, it would be to add Serialize/Deserialize to tensors :)

antimora commented 1 year ago

@jmacglashan Thanks, James, for all your detailed explaining and description. I have been following on the sideline and learning from you. I do not have much to add to your discussion. However, I have proposed some changes (see #234 ) to the serialization aspect you mentioned.

Could you please share you views on this (especially for the last comment)? @nathanielsimard will be making changes to the API soon. I am working on #204 that requires loading data to the generated model (generated source code from ONNX file). My proposed solution to create a custom serde de/serializer and serialize module. This work may also cover your ask about serializing tensors.

nathanielsimard commented 1 year ago

Hi @jmacglashan and thank you for the response.

Regarding a database inside the neural network, it makes a lot of sense to do that around the net, but it may be impossible depending on the architecture of the net, and maybe not desirable in some use cases. Regarding missing gradients, I highly prefer using None rather than zeros, since it's the correct representation here, even if it's more complicated to handle.

You point is right though, at some point people can write their own optimizer if they want to, but it will be impossible for them to use the learner struct and all the goodies that comes with it. In a way, I would prefer having multiple traits for the optimizer, where the fundamental one is as general as possible, and the other ones more specialized to improve the DX. My goal here is to find the basic one.

Note that I try my best to separate the codebase into well defined abstractions that can be used by themselve, so it's totaly OK to build your more specialized abstractions on top of Burn if your prefer. Still, the API I'm working on right now is deeply functional, so I hope it could satisfy most people's needs.

The discussion made me realise the values of the project, and their priorities with respect to each other. I would say Flexibility > Performance > Simplicity in that order, as a simpler abstraction can easily be derived from a more flexible one.

@antimora I'm going to answer in the other issue, but yes there are some changes to make for performance!

jmacglashan commented 1 year ago

Thanks!

@antimora I left some thoughts on your issue. Certainly if Tensor ended up getting serde support in that process, I would be happy :)

@nathanielsimard I definitely like that Burn is headed in a functional direction. And even if I build my own abstractions, the more Burn moves in that direction, the thinner my additions would likely need to be, which is great :)

antimora commented 1 year ago

@nathanielsimard To follow up, please update this ticket with your recent progress in the code space and documentation. I'm aware that you addressed many of the points raised in this issue during your conversation with @jmacglashan. Feel free to mention your efforts to make certain parts fully functional, as well as your experiment to make it fully function. Additionally, please include any updates related to serialization @jmacglashan wanted to see.

It would be helpful to see some resolution for this ticket, including implemented features, decisions made, and any future tasks (so that our work actually reflect requests).

nathanielsimard commented 1 year ago

The goal of Burn is not to be functional, though functional patterns are often friendly to the kind of problem we are facing. I think Burn should aims to be Rustly, which is a mix of many paradigns.

In term of serialization, tensors now implement Record, which is an abstraction over serde (see https://github.com/burn-rs/burn/blob/main/ARCHITECTURE.md#serialization).

There are two optimizer traits, so optimizer implementations are more flexible, since they can chose which trait they want to implement (The more general trait or the simpler trait, see https://github.com/burn-rs/burn/blob/main/ARCHITECTURE.md#optimization).

antimora commented 1 year ago

Thank you, Nathaniel. I believe now this ticket is resolved (through the discussing and recent refactor work) and we can safely close it. I scanned through comments and I do not see additional tasks.

The discussion and ideas in this ticket are so rich with information that I feel we should direct people if they're thinking about the functional aspect for Burn. @nathanielsimard where should we link from the architecture doc or create an FAQ doc?

@jmacglashan, I hope the recent code changes make your experience using Burn easier, although we did not incorporate all your suggested APIs changes. Let us know if you agree with the resolution of this ticket. For now we will close it so that it does not come up in our deck tasks.