Implement operations on variables

andreaferretti commented 5 years ago

I hope I am not bothering you too much by asking questions in the issue tracker, in case let me know.

This time I am trying to port operations from the tensor level to the variable level. For instance, here is what happens if I try to take the mean along a particular axis for a variable:

# part of python code I am porting
mean = input.mean(-1, keepdim=True)

# trying to do the same in Nim by explicitly passing
# the last axis instead of -1, followed by unsqueeze
template dimensions[TT](v: Variable[TT]): int =
  v.value.shape.len

proc forward(ln: LayerNorm, input: Variable): Variable =
  let
    L = dimensions(input) - 1
    mean = input.mean(L).unsqueeze(L)

I get an error

Error: type mismatch: got <Variable[Tensor[system.float32]], int>
but expected one of: 
proc mean[TT](a: Variable[TT]): Variable[TT]
proc mean[T: SomeInteger](t: Tensor[T]; axis: int): Tensor[T]
  first type mismatch at position: 1
  required type: Tensor[mean.T]
  but expression 'input' is of type: Variable[Tensor[system.float32]]
proc mean[T: SomeFloat](t: Tensor[T]): T
  first type mismatch at position: 1
  required type: Tensor[mean.T]
  but expression 'input' is of type: Variable[Tensor[system.float32]]
proc mean[T: SomeFloat](t: Tensor[T]; axis: int): Tensor[T]
  first type mismatch at position: 1
  required type: Tensor[mean.T]
  but expression 'input' is of type: Variable[Tensor[system.float32]]
proc mean[T: SomeInteger](t: Tensor[T]): T
  first type mismatch at position: 1
  required type: Tensor[mean.T]
  but expression 'input' is of type: Variable[Tensor[system.float32]]

Apparently, the operation of taking the mean of a tensor along some axis is already supported. I am not sure how to lift this to the variable level.

I have a similar issue in trying to compute a softmax: it is implemented for tensors, but not for variables. (actually I would need log_softmax, but that's for another moment, let's start with what we already have) Should I define manually a backward propagation operator? Or can autograd help me in these simple cases?

I guess these questions are a little silly, but not being familiar with the inner working of Arraymancer even small issues like these can leave me unsure how to proceed

mratsim commented 5 years ago

I don't think those questions are silly at all. The autograd part being still in flux is why I mention in the readme:

Arraymancer as a Deep Learning library

Deep learning features can be explored but are considered unstable while I iron out their final interface.

Reminder: The final interface is still work in progress.

So regarding mean. You can refer to how I ported mean and sum in this file:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/gates_reduce.nim#L20-L105

I should probably make a tutorial about making a new autograd operations.

For softmax, you can refer to sigmoid implementation:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/nn/activation/sigmoid.nim#L19-L60

Or tanh:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/nn/activation/tanh.nim#L19

Autograd structure

Operations on common (tensor and matrix) math goes in the folder https://github.com/mratsim/Arraymancer/tree/master/src/autograd for example:
- addition, matrix multiplication, ...
- reduction like mean and sum
- reshaping, concatenating, ... i.e. what is implemented in the tensor folder.
Operations specific to neural networks are in the nn folder https://github.com/mratsim/Arraymancer/tree/master/src/nn which is split into:
- activation
- layers
- loss
The primitives to those are in nn_primitives: https://github.com/mratsim/Arraymancer/tree/master/src/nn_primitives
If the layer has trainable weights, I also add it to nn_dsl: https://github.com/mratsim/Arraymancer/tree/master/src/nn_dsl

A note on type convention, when you see TT it means Tensor[T]

Implementing an autograd operation

The autograd is a tree (directed acyclic graph) called Context, it starts empty and when you execute an operation on a Variable during the forward pass a node is appended to the tree.

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/ag_data_structure.nim#L26-L36

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/ag_data_structure.nim#L44-L46

A node contains:

The operation called Gate
The input Variables to this operation, called Parents
And the result Variable(s) called Payload

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/ag_data_structure.nim#L75-L97

Variables

Variables are wrappers around a Tensor and its gradient:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/ag_data_structure.nim#L48-L56

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/ag_data_structure.nim#L68-L72

Note that the gradient should always have the same shape as the value (except when no gradient is needed).

Is the gradient needed

The context has a field called no_grad and there is also a finer control on Variable with requires_grad. Only weights of learnable layers require a gradient so user-defined variables are set to false by default.

If any of those fields is false: gradient will not be computed in the backward pass for this layer, and some optimisations (like not allocating a gradient tensor) are done during the forward pass.

You can switch autograd off temporarily with no_grad_mode and check if it's needed with is_grad_needed.

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/ag_data_structure.nim#L183-L198

no_grad_mode is aimed at users for validation or inference while is_grad_needed is for use when implementing NN layers.

The Gate

The Gate object is responsible of storing everything needed for the backward pass if there is one. The Gate name comes from Karpathy's Hacker guide to NN and CS231n course

2018-12-12_19-52-00

The Gate type

Gates inherit from Gate[T], you need to add fields if a state is needed for backpropagation.

For example AddGate has no state since backpropagating through add is constant = 1:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/gates_basic.nim#L21-L22

but for matrix multiplication we need to remember both inputs:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/gates_blas.nim#L18-L21

and for MaxPool we don't need to remember any tensor but we need to remember the shape of the input, which indices were selected and the operation metadata:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/nn/layers/maxpool2D.nim#L20-L23

Gates proc and methods

The private `forward` proc

Each gate have a forward proc (name and type signature not a constraint). it is in charge of unpacking the Variables and send the value to the low-level tensor implementation and repack that into result Variables

This is normally very simple for example with addition:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/gates_basic.nim#L23-L27

Just be aware that with reductions and loss functions, the base procs on tensor produce a scalar so you must unfortunately rewrap it in a Tensor then Variable for example for mean:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/gates_reduce.nim#L20-L28

And MSELoss:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/nn/loss/mean_square_error_loss.nim#L20-L31

If there is an inference-only fast path, this is where to add it, for example for GRU:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/nn/layers/gru.nim#L30-L60

The `backward` method

First of all, it is public only because methods must be for dynamic dispatch. Once you have your sequence of operation in the autograd context and you call backward on your Variable, Arraymancer will repeatedly call backward (line 227) from your Variable node until the context is pop'ed empty (line 225):

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/autograd/ag_data_structure.nim#L204-L232

So backward must have the following signature:

method backward*[TT](self: Gate[TT], payload: Payload[TT]): SmallDiffs[TT]

SmallDiffs[TT] is just a sequence of Tensor[T] holding the gradient corrections for this layer. It must be of the same length and the same order as Parents in the node parents field.

If the method signature is not enough, you have to pass the data as a field of the gate.

Going back to the MaxPool2D layer metadata:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/nn/layers/maxpool2D.nim#L34-L41

The `gate` proc (gate in lower-case)

This is where it gets complex though in reality it's pretty repetitive.

First of all this is public and this is how users interact with the autograd. It should mirror operations on normal tensors if they exist (like +, *, sum, mean) or mirror PyTorch nn.functional API (conv2D).

In terms of layout:

Check inputs
Create a Gate object and fill its field with what is needed for backprop (this will change with #301 when backprop is not required)
Create a Node, add the parents
Push the node in the context
Get the result of the forward pass into result
Update the node payload with the result as well. Note that Variable, Gate and Node have reference semantics so you can hold on to them even after they've been pushed. (Though I'm always looking into ways to avoid them)
Initialise the grad fields of each result Variable if gradient is needed.

For example for maxpool:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/nn/layers/maxpool2D.nim#L43-L92

Note that for this API, weights are not hidden from the user (just like in PyTorch). Hiding the weight machinery is the role of the neural network DSL.

For example GRU have an unhealthy amount of inputs (which is why PyTorch doesn't offer a functional API for them I guess):

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/nn/layers/gru.nim#L79-L159

Testing

Once you went through all of this you need to test.

Testing the whole layer

Apart from some exceptions they take the form of:

Create random tensor inputs

Create a unary proc for each input that does

proc myproc_myinput(myinput: Tensor[float64]): float64 = myproc(input1, input2, input3).sum()

Create using numerical_gradient(myinput, myproc_myinput) create the target analytical gradient

For example with stack operation:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/tests/autograd/test_gate_shapeshifting.nim#L20-L39 Couple of notes:
- float64 is used for precision even though in NN we mostly use float32 or float16
- sum is used to reduce to a single number. Due to the sum having a gradient of 1 it doesn't modify the gradient for back propagation.
- + and - are tensor addition and substraction and where used to ensure that stack properly split the backpropagation. This is not needed for computation layers but nice to have for copy/reshaping/splitting/concatenating layers.

And then, you wrap the same tensors in a variable, run your forward pass and check either the mean_relative_error or the mean_absolute_error:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/tests/autograd/test_gate_shapeshifting.nim#L41-L55

Back to exceptions: those are the first layers I implemented like Add and MatMul, they have a specific formula so I didn't use analytical gradient: https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/tests/autograd/test_gate_basic.nim#L19-L36

Testing the primitives

If the primitive already exists, there is probably already a test for the forward pass in either https://github.com/mratsim/Arraymancer/tree/master/tests/tensor or a test for both forward and backward in https://github.com/mratsim/Arraymancer/tree/master/tests/nn_primitives.

If only a forward test exist we need a backpropagation test as well in https://github.com/mratsim/Arraymancer/tree/master/tests/autograd.

I test the primitives separately (forward, separated from backward) when the layer is quite complex like convolution or gru and I want to make sure that I've implemented the forward pass properly before trying to find bugs in the backward pass. It is followed by a full-blown forward+backward with random values and numerical gradients.

mratsim commented 5 years ago

Loss layers

This reminds me that I didn't talk about Loss layers.

Instead of inheriting from Gate directly they inherit from Loss or SparseLoss

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/nn/loss/loss.nim#L20-L24

Otherwise it's the exact same, see for example the MSELoss:

https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/src/nn/loss/mean_square_error_loss.nim#L20-L76

Upcoming change

I have a very strange issue with the backward generic method in #327 and object variant field.

When creating a type that contains a Variable[Tensor[Something]], if something is generic it works, if something is concrete, I get undeclared identifier in a backward method. This is annoying when creating a model from scratch.

This affected me again when I was trying to implement new optimisers in this branch https://github.com/mratsim/Arraymancer/tree/optimizers

So tonight I'll refactor the autograd to completely remove methods from the codebase. This will be the new types for the autograd:

type
  Context*[TT] = ref object
    nodes: seq[Node[TT]]

  Node*[TT] = object
    gate: Gate[TT]
    backward: proc(self: Gate[TT], payload: TT): TT {.nimcall.}

  Gate*[TT] = ref object of RootObj

type
  AddGate[TT] = ref object of Gate[TT]
    cache: TT

  MulGate[TT] = ref object of Gate[TT]
    cache_a, cache_b: TT

proc backAdd[TT](self: AddGate[TT], payload: TT): TT =
  echo "Add backward"
  echo self.cache

proc backMul[TT](self: MulGate[TT], payload: TT): TT =
  echo "Mul backward"
  echo self.cache_a
  echo self.cache_b

let ctx = Context[int]()

block:
  let a = AddGate[int](cache: 10)
  let na = Node[int](gate: a, backward: backAdd[int])
  ctx.nodes.add na

block:
  let b = MulGate[int](cache_a: 100, cache_b: 1000)
  let nb = Node[int](gate: b, backward: backMul[int])
  ctx.nodes.add nb

for node in ctx.nodes:
  discard node.backward(node.gate, 1234)

I.e. Node will have a new backward field that takes a proc so all layers will also have to fill it. I will also change nodes to stack object to reduce stress on the GC.

As a benefit

it should be faster as we don't spend time in the if/else of method dispatch (and there is no extra vtable indirection as well).
Proc + generics + variants usually work.
backward can be private and named anything now.

andreaferretti commented 5 years ago

Wow, this is great documentation!! Together with what you told into my previous issue it makes a great tutorial for contributing to Arraymancer.

I have to say, it is a very clean architecture! I will try to follow your indications and start contributing something easy (cannot really promise anything, though, since I may have more pressing things do do at work, but I'll try my best)

mratsim / Arraymancer