Open andreaferretti opened 5 years ago
I don't think those questions are silly at all. The autograd part being still in flux is why I mention in the readme:
Arraymancer as a Deep Learning library
Deep learning features can be explored but are considered unstable while I iron out their final interface.
Reminder: The final interface is still work in progress.
So regarding mean. You can refer to how I ported mean
and sum
in this file:
I should probably make a tutorial about making a new autograd operations.
For softmax, you can refer to sigmoid implementation:
Or tanh:
tensor
folder.nn
folder https://github.com/mratsim/Arraymancer/tree/master/src/nn which is split into:
nn_primitives
: https://github.com/mratsim/Arraymancer/tree/master/src/nn_primitivesnn_dsl
: https://github.com/mratsim/Arraymancer/tree/master/src/nn_dslA note on type convention, when you see TT
it means Tensor[T]
The autograd is a tree (directed acyclic graph) called Context
, it starts empty and when you execute an operation on a Variable
during the forward pass a node is appended to the tree.
A node contains:
Gate
Variables
to this operation, called Parents
Variable(s)
called Payload
Variables
are wrappers around a Tensor and its gradient:
Note that the gradient should always have the same shape as the value (except when no gradient is needed).
The context has a field called no_grad
and there is also a finer control on Variable with requires_grad
. Only weights of learnable layers require a gradient so user-defined variables are set to false by default.
If any of those fields is false: gradient will not be computed in the backward pass for this layer, and some optimisations (like not allocating a gradient tensor) are done during the forward pass.
You can switch autograd off temporarily with no_grad_mode
and check if it's needed with is_grad_needed
.
no_grad_mode
is aimed at users for validation or inference while is_grad_needed
is for use when implementing NN layers.
The Gate
object is responsible of storing everything needed for the backward pass if there is one.
The Gate
name comes from Karpathy's Hacker guide to NN and CS231n course
Gates inherit from Gate[T], you need to add fields if a state is needed for backpropagation.
For example AddGate has no state since backpropagating through add
is constant = 1:
but for matrix multiplication we need to remember both inputs:
and for MaxPool we don't need to remember any tensor but we need to remember the shape of the input, which indices were selected and the operation metadata:
forward
procEach gate have a forward
proc (name and type signature not a constraint). it is in charge of unpacking the Variables
and send the value to the low-level tensor implementation and repack that into result Variables
This is normally very simple for example with addition
:
Just be aware that with reductions and loss functions, the base procs on tensor produce a scalar so you must unfortunately rewrap it in a Tensor then Variable for example for mean
:
And MSELoss:
If there is an inference-only fast path, this is where to add it, for example for GRU:
backward
methodFirst of all, it is public only because methods
must be for dynamic dispatch. Once you have your sequence of operation in the autograd context and you call backward on your Variable
, Arraymancer will repeatedly call backward
(line 227) from your Variable
node until the context is pop'ed empty (line 225):
So backward must have the following signature:
method backward*[TT](self: Gate[TT], payload: Payload[TT]): SmallDiffs[TT]
SmallDiffs[TT]
is just a sequence of Tensor[T]
holding the gradient corrections for this layer.
It must be of the same length and the same order as Parents
in the node parents
field.
If the method signature is not enough, you have to pass the data as a field of the gate.
Going back to the MaxPool2D layer metadata:
gate
proc (gate in lower-case)This is where it gets complex though in reality it's pretty repetitive.
First of all this is public and this is how users interact with the autograd. It should mirror operations on normal tensors if they exist (like +
, *
, sum
, mean
) or mirror PyTorch nn.functional API (conv2D
).
In terms of layout:
Gate
object and fill its field with what is needed for backprop (this will change with #301
when backprop is not required)Node
, add the parentsresult
Variable
, Gate
and Node
have reference semantics so you can hold on to them even after they've been pushed. (Though I'm always looking into ways to avoid them)grad
fields of each result Variable if gradient is needed.For example for maxpool
:
Note that for this API, weights are not hidden from the user (just like in PyTorch). Hiding the weight machinery is the role of the neural network DSL.
For example GRU have an unhealthy amount of inputs (which is why PyTorch doesn't offer a functional API for them I guess):
Once you went through all of this you need to test.
Apart from some exceptions they take the form of:
proc myproc_myinput(myinput: Tensor[float64]): float64 = myproc(input1, input2, input3).sum()
Create using numerical_gradient(myinput, myproc_myinput)
create the target analytical gradient
For example with stack
operation:
https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/tests/autograd/test_gate_shapeshifting.nim#L20-L39 Couple of notes:
float64
is used for precision even though in NN we mostly use float32 or float16sum
is used to reduce to a single number. Due to the sum having a gradient of 1 it doesn't modify the gradient for back propagation.+
and -
are tensor addition and substraction and where used to ensure that stack
properly split the backpropagation. This is not needed for computation layers but nice to have for copy/reshaping/splitting/concatenating layers.And then, you wrap the same tensors in a variable, run your forward pass and check either the mean_relative_error
or the mean_absolute_error
:
Back to exceptions: those are the first layers I implemented like Add
and MatMul
, they have a specific formula so I didn't use analytical gradient: https://github.com/mratsim/Arraymancer/blob/3ae364dc31647ac580464979d251868a25629ce5/tests/autograd/test_gate_basic.nim#L19-L36
If the primitive already exists, there is probably already a test for the forward pass in either https://github.com/mratsim/Arraymancer/tree/master/tests/tensor or a test for both forward and backward in https://github.com/mratsim/Arraymancer/tree/master/tests/nn_primitives.
If only a forward
test exist we need a backpropagation test as well in https://github.com/mratsim/Arraymancer/tree/master/tests/autograd.
I test the primitives separately (forward, separated from backward) when the layer is quite complex like convolution
or gru
and I want to make sure that I've implemented the forward
pass properly before trying to find bugs in the backward
pass. It is followed by a full-blown forward+backward with random values and numerical gradients.
This reminds me that I didn't talk about Loss
layers.
Instead of inheriting from Gate directly they inherit from Loss
or SparseLoss
Otherwise it's the exact same, see for example the MSELoss:
I have a very strange issue with the backward
generic method in #327 and object variant field.
When creating a type that contains a Variable[Tensor[Something]]
, if something is generic it works, if something is concrete, I get undeclared identifier in a backward
method. This is annoying when creating a model from scratch.
This affected me again when I was trying to implement new optimisers in this branch https://github.com/mratsim/Arraymancer/tree/optimizers
So tonight I'll refactor the autograd to completely remove methods
from the codebase. This will be the new types for the autograd:
type
Context*[TT] = ref object
nodes: seq[Node[TT]]
Node*[TT] = object
gate: Gate[TT]
backward: proc(self: Gate[TT], payload: TT): TT {.nimcall.}
Gate*[TT] = ref object of RootObj
type
AddGate[TT] = ref object of Gate[TT]
cache: TT
MulGate[TT] = ref object of Gate[TT]
cache_a, cache_b: TT
proc backAdd[TT](self: AddGate[TT], payload: TT): TT =
echo "Add backward"
echo self.cache
proc backMul[TT](self: MulGate[TT], payload: TT): TT =
echo "Mul backward"
echo self.cache_a
echo self.cache_b
let ctx = Context[int]()
block:
let a = AddGate[int](cache: 10)
let na = Node[int](gate: a, backward: backAdd[int])
ctx.nodes.add na
block:
let b = MulGate[int](cache_a: 100, cache_b: 1000)
let nb = Node[int](gate: b, backward: backMul[int])
ctx.nodes.add nb
for node in ctx.nodes:
discard node.backward(node.gate, 1234)
I.e. Node
will have a new backward field that takes a proc so all layers will also have to fill it.
I will also change nodes to stack object to reduce stress on the GC.
As a benefit
backward
can be private and named anything now.Wow, this is great documentation!! Together with what you told into my previous issue it makes a great tutorial for contributing to Arraymancer.
I have to say, it is a very clean architecture! I will try to follow your indications and start contributing something easy (cannot really promise anything, though, since I may have more pressing things do do at work, but I'll try my best)
I hope I am not bothering you too much by asking questions in the issue tracker, in case let me know.
This time I am trying to port operations from the tensor level to the variable level. For instance, here is what happens if I try to take the mean along a particular axis for a variable:
I get an error
Apparently, the operation of taking the mean of a tensor along some axis is already supported. I am not sure how to lift this to the variable level.
I have a similar issue in trying to compute a softmax: it is implemented for tensors, but not for variables. (actually I would need log_softmax, but that's for another moment, let's start with what we already have) Should I define manually a backward propagation operator? Or can autograd help me in these simple cases?
I guess these questions are a little silly, but not being familiar with the inner working of Arraymancer even small issues like these can leave me unsure how to proceed