mratsim / Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends
https://mratsim.github.io/Arraymancer/
Apache License 2.0
1.32k stars 95 forks source link

[NN primitives] unify var parameters vs return values #175

Open mratsim opened 6 years ago

mratsim commented 6 years ago

The library is inconsistent in nn_primitives for when to use return values and when to use var parameters that will be updated in place.

var parameter

https://github.com/mratsim/Arraymancer/blob/92aa82e8c7d2a30c4497d5cf367c89f30c648a68/src/nn_primitives/nnp_linear.nim#L20-L28

https://github.com/mratsim/Arraymancer/blob/92aa82e8c7d2a30c4497d5cf367c89f30c648a68/src/nn_primitives/fallback/conv.nim#L19-L20

https://github.com/mratsim/Arraymancer/blob/92aa82e8c7d2a30c4497d5cf367c89f30c648a68/src/nn_primitives/nnp_convolution.nim#L65-L70

return value

https://github.com/mratsim/Arraymancer/blob/92aa82e8c7d2a30c4497d5cf367c89f30c648a68/src/nn_primitives/fallback/conv.nim#L57-L59

https://github.com/mratsim/Arraymancer/blob/92aa82e8c7d2a30c4497d5cf367c89f30c648a68/src/nn_primitives/nnp_convolution.nim#L28-L31

Possibilities

  1. Always use var

    • πŸ‘ Makes sure there is no extra allocation
    • πŸ‘ Consistency
    • πŸ‘Ž Must declare result before calling function (and computes the shape for im2col for example)
    • πŸ‘Ž No chaining
  2. Always use return values

    • πŸ‘ Consistency
    • πŸ‘ No call burden
    • πŸ”’ Sometimes return value is a tuple (i.e. forces to read doc for unwrapping)
      • Can be alleviated with named tuple parameters
      • If tuple, it's not easy to chain functions (but is it needed for nn_primitives?)
  3. Return value if single, var parameters otherwise?

    • πŸ‘Ž Inconsistent
    • The worst of both worlds?
sherjilozair commented 6 years ago

Would both be feasible? Not having to allocate extra memory is a huge plus. Memory issues are the biggest bottleneck in model design, and tensorflow and others are considering in-place operations. If arraymancer can support memory reuse from the getgo, that would be great.

mratsim commented 6 years ago

I'm planning to have the low-level API for layers to only use var params (i.e. convolution, maxpooling, etc) and avoid memory allocation. For activations like relu or sigmoid I already provide a mrelu and msigmoid in-place version.

Future considerations

Sink / Destructors (#150)

Ideally with the future sink/destructors I can provide an optimization when only a single ref is used with a signature like this:

proc sigmoid[TT](v: sink Variable[TT]): Variable[TT] {.inline.}=
  msigmoid(v)
  result = v

Object pool / memory regions (#112)

I have started on an object pool due to the cost of allocating memory on Cuda. Custom allocators are planned with Nim destructors, which would ease the implementation of memory regions/object pools to reuse already allocated memory. See stub implementation here.

mratsim commented 6 years ago

Removing the easy tag.

It seems like I would need 3 levels of API:

  1. Very low-level that doesn't allocate at all. Workspace buffers will have to be pass as inputs, Fortran-style. In-place activation functions would be there (name to define).
  2. Functional API, similar to PyTorch's. To allow chaining it should allow chaining like x.function(weight, bias). This can be build at the Tensor level, with an overload for Variables.
  3. Keras-like high level API, which was started by the domain specific language #219. It should be syntactic sugar for initializing weights and biases, inferring their shapes and passing them as parameters i.e. x.function instead of x.function(weight, bias)

1 is key for efficient RNNs, as we can reuse the buffers across timesteps instead of allocating a new one every time. This is also required for OpenMP or MPI parallelism at a higher level (model parallelism) as allocations within an OpenMP context will crash.