originrose / cortex

Machine learning in Clojure
Eclipse Public License 1.0
1.27k stars 111 forks source link

Selu Activation + High Level Tensor Operations #247

Closed gigasquid closed 6 years ago

gigasquid commented 7 years ago

This PR adds support for the SELU activation https://github.com/thinktopic/cortex/issues/181

This implementation is a high level tensor based approach and introduces an attempt to make some tensor functions that wrap the unary and binary tensor functions and make them friendlier to use.

Note: This is very open to feedback and the implementation can be simply put in as a unary tensor operation :selu, with none of the other high level tensor function additions instead if desired

The rationale is that it would be nice to implement things like new activations in one place using tensor abstractions and have a clean syntax so that the logic of the activation is clear.

At first glance, I thought the selu activation which is:

(def SELU_ALPHA 1.6732632423543772848170429916717)
(def SELU_LAMBDA 1.0507009873554804934193349852946)

lambda*x for x > 0 and lambda * ((alpha * exp(x)) - alpha) for x <=0 and lambda for x > 0 and lambda * alpha exp(x) for x <= 0 for the gradient

would be pretty easy to implement. It would be straight forward as a new unary operation, but was more challenging to do it on a higher level tensor standpoint.

I thought it would be good to see what it would take to support it. These are the changes:

with them the SELU activation looks like:

(def SELU_ALPHA 1.6732632423543772848170429916717)
(def SELU_LAMBDA 1.0507009873554804934193349852946)

(defn selu
  "lambda*x for x > 0 and lambda * ((alpha * exp(x)) - alpha) for x <=0"
  [input output]
  (where output
         (> (new-tensor input) input 0)
         ; lambda*x for x > 0
         (* (new-tensor input) input SELU_LAMBDA)
         ;  lambda * ((alpha * exp(x)) - alpha) for x <=0
         (-> (exp (new-tensor input) input)
             (* SELU_ALPHA)
             (- SELU_ALPHA)
             (* SELU_LAMBDA))))

I also tested it vs relu on the MNIST example network (without argumentation)

(defn initial-description
  [input-w input-h num-classes]
  [(layers/input input-w input-h 1 :id :data)
   (layers/convolutional 5 0 1 20)
   (layers/max-pooling 2 0 2)
   (layers/selu)
   (layers/convolutional 5 0 1 50)
   (layers/max-pooling 2 0 2)
   (layers/selu)
   (layers/linear 1000)
   (layers/dropout 0.4)
   (layers/linear num-classes)
   (layers/softmax :id :labels)])

Selu 100 epochs 0.977 Relu 100 epochs 0.979

From the paper, it seems like it would be more effective than RELU for a deeper network. Also it might be more effective with the Selu AlphaDropout implemented, which would be a future PR.

Again feedback is most welcome. This is just an approach that I thought would be interesting, but I'm not sure fits in with the your vision or understanding of other trade-offs.