Optimizers not landing where they should?

qfn commented 8 years ago

I have constructed a following simple network with 2 inner product layers where one has a ReLU neuron:

using Mocha
srand(12345678)
############################################################
# Prepare Random Data
############################################################
N = 10000
M = 4
X = rand(M, N)
Y = 10*max(X[1,:],0.6) + 0.0*randn(size(X[1,:]))
##########
backend = DefaultBackend()
init(backend)
data_layer = MemoryDataLayer(batch_size=10000, data=Array[X[1,:],Y])
weight_layer = InnerProductLayer(name="ip", output_dim=1,  neuron=Neurons.LReLU(), tops=[:data1], bottoms=[:data])
weight_layer1 = InnerProductLayer(name="ip1", output_dim=1, tops=[:data2], bottoms=[:data1])
loss_layer = SquareLossLayer(name="loss", bottoms=[:data2, :label])
mem_layer = MemoryOutputLayer(name="output", bottoms=[:data2])
net = Net("TEST", backend, [loss_layer, mem_layer, weight_layer, weight_layer1, data_layer])
println(net)

lr_policy = LRPolicy.Staged(
  (6000, LRPolicy.Fixed(0.0001)),
  (4000, LRPolicy.Fixed(0.001)),
)
method = SGD()
params = make_solver_parameters(method, regu_coef=0.0005, mom_policy=MomPolicy.Fixed(0.9), max_iter=10000, lr_policy=lr_policy)
solver = Solver(method, params)
add_coffee_break(solver, TrainingSummary(), every_n_iter=100)

solve(solver, net)

Since max(x,0.6) = 0.6 + max(x-0.6,0) I was hoping that the solver would be able to find a perfect matching solution. The part max(x-0.6,0) is a ReLU output on weight 1 and bias -0.6, so the whole network output should have been 6 + 10*ReLU_output. However, regardless of the optimizer I use, the solver lands in a spot which doesn't produce a good match for the data. As a matter of fact, the network output produced is a constant number.

Is there something I've missed here (maybe in the setup of the network or anything else) or are the solvers just not able to find true parameters in this case? What about some general solver behavior regarding piece wise linear functions?

Thank you.

pluskid commented 8 years ago

@qfn First of all, neural network learning is a very nonlinear nonconvex problem. Currently there is no guarantee or evidence that the SGD could learn to recover the exact underlying function even if the function could be exactly parameterized by the NN structure.

If you just want to learn to predict the max(x, 0.6) function reasonably well, you can try to fit with more hidden units instead of 1, and maybe even more number of hidden layers. It seems NNs are more powerful when over-parameterized.

qfn commented 8 years ago

@pluskid Thank you very much for your answer. The object I'm dealing with is an (unknown) function of several variables which is definitely non-linear and to some extent probably also non-convex. What I'm trying to do now is to get an understanding of the possibilities and limitations of the neural networks when applied to approximations of such objects and how to structure the NNs in order to deal with such functions.

pluskid / Mocha.jl

Optimizers not landing where they should? #210