pluskid / Mocha.jl

Deep Learning framework for Julia
Other
1.29k stars 254 forks source link

InexactError when training "LeNet" on 1d image data #190

Open cinvro opened 8 years ago

cinvro commented 8 years ago

I am new to Mocha, and I am trying to modify the LeNet tutorial for my 1d image dataset, basically what I do is to slightly change the kernel size, and stride size as follows:


data_layer  = AsyncHDF5DataLayer(name="data", source="data/train.txt", batch_size=64, shuffle=true)
conv_layer  = ConvolutionLayer(name="conv1", n_filter=20, kernel=(5,1), bottoms=[:data], tops=[:conv])
pool_layer  = PoolingLayer(name="pool1", kernel=(2,1), stride=(2,1), bottoms=[:conv], tops=[:pool])
conv2_layer = ConvolutionLayer(name="conv2", n_filter=50, kernel=(5,1), bottoms=[:pool], tops=[:conv2])
pool2_layer = PoolingLayer(name="pool2", kernel=(2,1), stride=(2,1), bottoms=[:conv2], tops=[:pool2])
fc1_layer   = InnerProductLayer(name="ip1", output_dim=500, neuron=Neurons.ReLU(), bottoms=[:pool2], tops=[:ip1])
fc2_layer   = InnerProductLayer(name="ip2", output_dim=2, bottoms=[:ip1], tops=[:ip2])
loss_layer  = SoftmaxLossLayer(name="loss", bottoms=[:ip2,:label])

After the network is constructed, I get following error message:

04-Apr 23:17:53:INFO:root:## Performance on Validation Set after 0 iterations
04-Apr 23:17:53:INFO:root:---------------------------------------------------------
04-Apr 23:17:53:INFO:root:  Accuracy (avg over 15300) = 93.8627%
04-Apr 23:17:53:INFO:root:---------------------------------------------------------
04-Apr 23:17:53:INFO:root:
04-Apr 23:17:54:DEBUG:root:#DEBUG Entering solver loop
ERROR: LoadError: InexactError()
 in max_pooling_forward at /Users/cinvro/.julia/v0.4/Mocha/src/layers/pooling/julia-impl.jl:34
 in forward at /Users/cinvro/.julia/v0.4/Mocha/src/layers/pooling.jl:93
 in forward at /Users/cinvro/.julia/v0.4/Mocha/src/layers/pooling.jl:84
 in forward at /Users/cinvro/.julia/v0.4/Mocha/src/net.jl:148
 in onestep_solve at /Users/cinvro/.julia/v0.4/Mocha/src/solvers.jl:222
 in do_solve_loop at /Users/cinvro/.julia/v0.4/Mocha/src/solvers.jl:242
 in solve at /Users/cinvro/.julia/v0.4/Mocha/src/solvers.jl:235
 in include at /Applications/Julia-0.4.3.app/Contents/Resources/julia/lib/julia/sys.dylib
 in include_from_node1 at /Applications/Julia-0.4.3.app/Contents/Resources/julia/lib/julia/sys.dylib
 in process_options at /Applications/Julia-0.4.3.app/Contents/Resources/julia/lib/julia/sys.dylib
 in _start at /Applications/Julia-0.4.3.app/Contents/Resources/julia/lib/julia/sys.dylib

Any idea why this happens?


My net looks like this:

net

pluskid commented 8 years ago

The line of code reporting InexactError is this line: https://github.com/pluskid/Mocha.jl/blob/master/src/layers/pooling/julia-impl.jl#L34

It is trying to assign a value to the mask, which is unsigned. If you try to assign an invalid value (e.g. a negative value), an InexactError will occur. My guessing was that the pooling range somehow goes out of range, making some negative value there. But looking at the visualization you pasted above, it seems perfectly valid. Can you maybe try to insert a print statement

println((maxh-1) * width + maxw-1)

right before that line to see what value we got that caused the error?

cinvro commented 8 years ago

@pluskid you are right, I got -180, where maxh=0, maxw=0 and width=179. What does that mean? Is that a problem of my data or a bug?

pluskid commented 8 years ago

It seems like some pooling region is empty. Just as a sanity check, can you change the kernel for the pooling layer from (2,1) to larger values like (3,1) to see if it runs? Thanks!

cinvro commented 8 years ago

Thank you for the reply. Yes. I got following error after changed the kernel size of pooling layer from (2,1) to (3,1).

ERROR: LoadError: AssertionError: is_similar_shape(params[j],net.states[i].parameters[j].blob)
 in load_network at /Users/cinvro/.julia/v0.4/Mocha/src/utils/io.jl:102
 in anonymous at /Users/cinvro/.julia/v0.4/Mocha/src/solvers.jl:158
 in jldopen at /Users/cinvro/.julia/v0.4/JLD/src/JLD.jl:245
 in load_snapshot at /Users/cinvro/.julia/v0.4/Mocha/src/solvers.jl:157
 in init_solve at /Users/cinvro/.julia/v0.4/Mocha/src/solvers.jl:184
 in solve at /Users/cinvro/.julia/v0.4/Mocha/src/solvers.jl:234
 in include at /Applications/Julia-0.4.3.app/Contents/Resources/julia/lib/julia/sys.dylib
 in include_from_node1 at /Applications/Julia-0.4.3.app/Contents/Resources/julia/lib/julia/sys.dylib
 in process_options at /Applications/Julia-0.4.3.app/Contents/Resources/julia/lib/julia/sys.dylib
 in _start at /Applications/Julia-0.4.3.app/Contents/Resources/julia/lib/julia/sys.dylib
pluskid commented 8 years ago

@cinvro That is due to previously saved snapshots. Can you remove the saved snapshot files and re-try again? Thanks!

cinvro commented 8 years ago

@pluskid oh, I didn't realize that. Now I get -179, where maxh=0, maxw=0 and width=178.

pluskid commented 8 years ago

@cinvro I checked the code and did not find the bug. It seems the pooling loop is not executed (otherwise maxh and maxw should not be zero). Can you at the same place print the values for hstart, hend, wstart, wend as well as val, maxval? On potential problem is that your matrix contains NaN. In this case, NaN > -Inf is false, so the pooling is unsuccessful.

cinvro commented 8 years ago

@pluskid I got hstart=1,hend=1,wstart=89,wend=90 and maxval=-Inf. I cannot print out val because it says val is undefined, which is very strange.

cinvro commented 8 years ago

However, I can print out val inside the for loop, which gives me val = -Inf in this case.

davidparks21 commented 8 years ago

I can reproduce this error when I do not set the neuron property on the convolutional layer. It took me a while to narrow it down, but once I set neuron=Neurons.ReLU() on the convolutional layer the InexactError (NaN value for maxval in function max_pooling_forward) went away.

I see that the code posted here also doesn't have a neuron defined on the convolutional layer, so I suspect the same is the case here.