Rectifier shouldn't use tensor.switch

janchorowski commented 8 years ago

the Rectifier brick uses tensor.switch (https://github.com/mila-udem/blocks/blob/master/blocks/bricks/simple.py#L302), according to a simple timing this is quite slower than tensor.maximum. What is the rationale for this? Maybe we should switch to the faster tensor.maximum?

Microbenchmark (TitanX):

    Xval = rand(2000,1000).astype('float32')*2-1
    def getfun(variant):
        X = theano.shared(Xval, name='X')
        Y = {
            0: TT.switch(X > 0, X, 0),
            1: TT.maximum(X, 0),
            2: (X>0)*X,
        }[variant]
        r = Y.sum()
        gs = theano.grad(r, X).sum()
        return theano.function([], [r, gs])

    for variant in [0,1,2]:
        print "Testing method %s" % ({0:'switch', 1:'maximum', 2:'multiplication'}[variant],)
        f = getfun(variant)
        printing.debugprint(f)
        %timeit f()
        print

Gives:

Testing method switch
HostFromGpu [id A] ''   8
 |GpuCAReduce{add}{1,1} [id B] ''   7
   |GpuElemwise{Switch}[(0, 0)] [id C] ''   6
     |GpuFromHost [id D] ''   5
     | |Elemwise{Cast{float32}} [id E] ''   3
     |   |Elemwise{gt,no_inplace} [id F] ''   1
     |     |HostFromGpu [id G] ''   0
     |     | |X [id H]
     |     |TensorConstant{(1, 1) of 0} [id I]
     |X [id H]
     |CudaNdarrayConstant{Shape: (1, 1)
[[ 0.]]} [id J]
Sum{acc_dtype=float64} [id K] ''   4
 |Elemwise{Switch} [id L] ''   2
   |Elemwise{gt,no_inplace} [id F] ''   1
   |TensorConstant{(1, 1) of 1.0} [id M]
   |TensorConstant{(1, 1) of 0.0} [id N]

10 loops, best of 3: 18.7 ms per loop

Testing method maximum
HostFromGpu [id A] ''   3
 |GpuCAReduce{add}{1,1} [id B] ''   1
   |GpuElemwise{maximum,no_inplace} [id C] ''   0
     |X [id D]
     |CudaNdarrayConstant{Shape: (1, 1)
[[ 0.]]} [id E]
HostFromGpu [id F] ''   5
 |GpuCAReduce{add}{1,1} [id G] ''   4
   |GpuElemwise{Composite{Identity(EQ(i0, i1))}}[(0, 0)] [id H] ''   2
     |GpuElemwise{maximum,no_inplace} [id C] ''   0
     |X [id D]

100 loops, best of 3: 7.1 ms per loop

Testing method multiplication
HostFromGpu [id A] ''   8
 |GpuCAReduce{add}{1,1} [id B] ''   7
   |GpuElemwise{Mul}[(0, 0)] [id C] ''   6
     |GpuFromHost [id D] ''   5
     | |Elemwise{Cast{float32}} [id E] ''   3
     |   |Elemwise{gt,no_inplace} [id F] ''   1
     |     |HostFromGpu [id G] ''   0
     |     | |X [id H]
     |     |TensorConstant{(1, 1) of 0} [id I]
     |X [id H]
Sum{acc_dtype=float64} [id J] ''   4
 |Elemwise{Identity} [id K] ''   2
   |Elemwise{gt,no_inplace} [id F] ''   1

100 loops, best of 3: 10.9 ms per loop

dwf commented 8 years ago

@nouiz Thoughts?

This is such a common function that maybe Theano should encode the best practice way of doing a rectifier in tensor.nnet.

nouiz commented 8 years ago

This is pretty complicated. We added in Theano a theano.tensor.nnet.relu() that encode what we recommand.

The timing should not contain transfer. This make the timing not as good. The best speed of all the variant wasn't the same on the CPU and GPU. Also, the forward and the grad didn't had the same fastest implementation... The one in tensor.relu() is the best compromize we got.

Also, the best will also depend of what is round the relu in the graph. If there is other elemwise, like the addition of the bias, then this will trigger even more cases that could change the best.

In all cases, why do you want to spend time optimizing this more? How much % of time do you spend in relu? I'm thinking that very few % of time is spent in it, so I don't think it is worth optimizing this more.

nouiz commented 8 years ago

Also, cudnn have a relu operation. If you really want to optimize it on the GPU, someone should wrap it. But I don't think it will make a big difference.

dmitriy-serdyuk commented 8 years ago

In this case we should use tensor.nnet.relu in our brick. It should be an easy task for CCW.

rizar commented 8 years ago

Isn't it enough to change one single line? That's too much for a CCW ticket!

On 1 February 2016 at 11:44, dmitriy-serdyuk notifications@github.com wrote:

In this case we should use tensor.nnet.relu in our brick. It should be an easy task for CCW.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/960#issuecomment-178062055.

rizar commented 8 years ago

Oops, I meant too little.

On 1 February 2016 at 13:15, Dzmitry Bahdanau dimabgv@gmail.com wrote:

Isn't it enough to change one single line? That's too much for a CCW ticket!

On 1 February 2016 at 11:44, dmitriy-serdyuk notifications@github.com wrote:

In this case we should use tensor.nnet.relu in our brick. It should be an easy task for CCW.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/960#issuecomment-178062055.

dmitriy-serdyuk commented 8 years ago

Well, one needs to benchmark it, there is some work.

janchorowski commented 8 years ago

I can do it, since I actually had a pathological net with ACDC layers that spent 10% time in memory copies induced by tensor.switches

rizar commented 8 years ago

That would be cool, but in this case please do not make it another long-standing ticket that we have to remember about for three months :)

rizar commented 8 years ago

@janchorowski , ping.

dwf commented 8 years ago

Okay, this ticket really needs to be addressed soon. This is getting a little ridiculous. We should switch to tensor.nnet.relu and push any discussions of speed upstream. This is what both Lasagne and Keras are doing (which between them have 10x our users if we use GitHub stars as a proxy) so it's probably a pretty reasonable thing to do.

mila-iqia / blocks

Rectifier shouldn't use tensor.switch #960