pluskid / Mocha.jl

Deep Learning framework for Julia
Other
1.29k stars 254 forks source link

Out of memory error #95

Open nikolaypavlov opened 9 years ago

nikolaypavlov commented 9 years ago

I'm constantly getting "out of memory" error while doing a grid search. It seems like Mocha maps tons of memory for GPU backend. If I'm using a CPU backend everything is fine and memory usage is low. screenshot 2015-08-31 11 39 29

Is this a bug? Is there any workaround?

julia> Pkg.status()
12 required packages:
 - CUBLAS                        0.0.1
 - CUDA                          0.1.0
 - CUDArt                        0.1.3
 - CUFFT                         0.0.3
 - Cairo                         0.2.29
 - Colors                        0.5.2
 - DataFrames                    0.6.9
 - HttpParser                    0.0.13
 - IJulia                        0.2.5
 - Images                        0.4.46
 - MLBase                        0.5.1
 - Mocha                         0.0.9+             master
30 additional packages:
 - ArrayViews                    0.6.3
 - BinDeps                       0.3.15
 - Blosc                         0.1.4
 - ColorTypes                    0.1.3
 - ColorVectorSpace              0.0.2
 - Compat                        0.6.0
 - DataArrays                    0.2.18
 - Dates                         0.3.2
 - Docile                        0.5.16
 - FixedPointNumbers             0.0.10
 - GZip                          0.2.17
 - Graphics                      0.1.0
 - HDF5                          0.5.5
 - HttpCommon                    0.1.2
 - Iterators                     0.1.8
 - JLD                           0.5.4
 - JSON                          0.4.5
 - Logging                       0.1.1
 - Nettle                        0.1.10
 - REPLCompletions               0.0.3
 - Reexport                      0.0.2
 - SHA                           0.1.1
 - SIUnits                       0.0.5
 - SortingAlgorithms             0.0.5
 - StatsBase                     0.7.1
 - StatsFuns                     0.1.2
 - TexExtensions                 0.0.2
 - URIParser                     0.0.7
 - ZMQ                           0.2.0
 - Zlib                          0.1.9

julia> versioninfo()
Julia Version 0.3.11
Commit 483dbf5 (2015-07-27 06:18 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: liblapack.so.3
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
pluskid commented 9 years ago

Is the out of memory error from cpu memory or gpu memory? Gpu memory is separated from cpu memory and is typically much smaller.

nikolaypavlov commented 9 years ago

Exact error message is the following:

31-Aug 02:31:47:INFO:root:Constructing net SVHN-train on GPUBackend...
31-Aug 02:31:47:INFO:root:Topological sorting 14 layers...
31-Aug 02:31:47:INFO:root:Setup layers...
31-Aug 02:31:49:INFO:root:Network constructed!
31-Aug 02:31:49:INFO:root:Constructing net SVHN-train-prediction on GPUBackend...
31-Aug 02:31:49:INFO:root:Topological sorting 9 layers...
31-Aug 02:31:49:INFO:root:Setup layers...
Out of memory
while loading In[28], in expression starting on line 1

 in cualloc at /home/quetzal/.julia/v0.3/Mocha/src/cuda/cuda.jl:80
 in make_blob at /home/quetzal/.julia/v0.3/Mocha/src/cuda/blob.jl:42
 in make_blob at /home/quetzal/.julia/v0.3/Mocha/src/blob.jl:102
 in ConvolutionLayerState at /home/quetzal/.julia/v0.3/Mocha/src/layers/convolution.jl:96
 in setup at /home/quetzal/.julia/v0.3/Mocha/src/layers/convolution.jl:158
 in Net at /home/quetzal/.julia/v0.3/Mocha/src/net.jl:227
 in configure_training at In[14]:33
 in estfun at In[23]:6
 in gridtune at /home/quetzal/.julia/v0.3/MLBase/src/modeltune.jl:21

BTW it only happens if I'm running grid search, so the network needs to be run at least once before it fails..

Here is also htop output: screenshot 2015-08-31 12 30 50

At the same time the CPU backend memory usage for the same network is very low: screenshot 2015-08-31 12 37 40

nikolaypavlov commented 9 years ago

Here is also my video card info:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN X"
  CUDA Driver Version / Runtime Version          7.0 / 6.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 12288 MBytes (12884574208 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Clock rate:                                1216 MHz (1.22 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = GeForce GTX TITAN X
Result = PASS
nikolaypavlov commented 9 years ago

Ok. It seems like I have found a workaround. I have to init and shutdown the GPU backend between each new set of hyperparameters. Though the actual implementation looks weird, because of how the gridtune() works:

using MLBase

function evalfun(netInfo)
    pred = predict(netInfo[:valid_net], netInfo[:base_dir])
    model_perfomance = mean(pred .== netInfo[:validLabels])

    backend = netInfo[:net].backend
    shutdown(backend)
    return(model_perfomance)
end

function estfun(nunits_fc1, nunits_fc2, conv1_nfilt, conv2_nfilt, base_mom, base_lr, regu_coef)
    backend = use_gpu ? GPUBackend() : CPUBackend()
    init(backend)

    snapshot_dir = "snapshot_drop_conv_$(conv1_nfilt)_$(conv2_nfilt)_$(nunits_fc1)_$(nunits_fc2)_$(base_mom)_$(base_lr)_$(regu_coef)"
    net, train_net, valid_net, common_layers = configure_training(backend, nunits_fc1, nunits_fc2, conv1_nfilt, conv2_nfilt)
    solver = configure_solver(MAXITER, base_mom, base_lr, EPOCH, snapshot_dir, regu_coef)
    configure_coffebreaks(solver, train_net, valid_net, snapshot_dir)
    solve(solver, net) 

    model = {:net => net, 
             :valid_net => valid_net,
             :base_dir => snapshot_dir, 
             :validLabels => labelsInfoValid[:Labels],
             :common_layers => common_layers}

    return(model)
end

best_model, best_cfg, best_score = gridtune(estfun, evalfun, nunits_fc1, nunits_fc2, conv1_nfilt, conv2_nfilt, base_mom, base_lr, regu_coef; verbose=true)
pluskid commented 9 years ago

Can you try to see if calling this function registry_reset(backend::Backend) instead of shutdown and init it would help?

nikolaypavlov commented 9 years ago

It doesn't work. It fails after several iterations.

08-Sep 08:25:09:INFO:root:Initializing CuDNN backend...
INFO: CuDNN backend initialized!
08-Sep 08:25:10:INFO:root:Constructing net SVHN-train on GPUBackend...
08-Sep 08:25:10:INFO:root:Topological sorting 12 layers...
08-Sep 08:25:10:INFO:root:Setup layers...
08-Sep 08:25:11:INFO:root:Network constructed!
08-Sep 08:25:11:INFO:root:Constructing net SVHN-train-prediction on GPUBackend...
08-Sep 08:25:11:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:11:INFO:root:Setup layers...
08-Sep 08:25:11:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:11:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:11:INFO:root:Network constructed!
08-Sep 08:25:11:INFO:root:Constructing net SVHN-validation-prediction on GPUBackend...
08-Sep 08:25:11:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:11:INFO:root:Setup layers...
08-Sep 08:25:11:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:11:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:11:INFO:root:Network constructed!
08-Sep 08:25:11:DEBUG:root:Checking network topology for back-propagation
08-Sep 08:25:11:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0/snapshot-000000.jld
08-Sep 08:25:12:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:12:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:12:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:12:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:12:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:12:DEBUG:root:Init network SVHN-train
08-Sep 08:25:13:DEBUG:root:Initializing coffee breaks
08-Sep 08:25:13:INFO:root:Merging existing coffee lounge statistics in snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0/statistics.jld
08-Sep 08:25:13:DEBUG:root:Init network SVHN-train-prediction
08-Sep 08:25:13:DEBUG:root:Init network SVHN-validation-prediction
08-Sep 08:25:13:INFO:root:Snapshot directory snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0 already exists
08-Sep 08:25:13:INFO:root:ITER = 000000:: TRAIN obj-val = 5.19213486:: LR = 0.00500000:: MOM = 0.95000000
08-Sep 08:25:14:INFO:root:
08-Sep 08:25:14:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:14:INFO:root:---------------------------------------------------------
08-Sep 08:25:14:INFO:root:  Accuracy (avg over 6220) = 1.5273%
08-Sep 08:25:14:INFO:root:---------------------------------------------------------
08-Sep 08:25:14:INFO:root:
08-Sep 08:25:14:INFO:root:
08-Sep 08:25:14:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:14:INFO:root:---------------------------------------------------------
08-Sep 08:25:14:INFO:root:  Accuracy (avg over 785) = 1.9108%
08-Sep 08:25:14:INFO:root:---------------------------------------------------------
08-Sep 08:25:14:INFO:root:
08-Sep 08:25:14:INFO:root:Saving snapshot to snapshot-000000.jld...
08-Sep 08:25:14:WARNING:root:Overwriting snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0/snapshot-000000.jld...
08-Sep 08:25:14:DEBUG:root:Saving parameters for layer conv1
08-Sep 08:25:14:DEBUG:root:Saving parameters for layer conv2
08-Sep 08:25:14:DEBUG:root:Saving parameters for layer ip1
08-Sep 08:25:14:DEBUG:root:Saving parameters for layer ip2
08-Sep 08:25:14:DEBUG:root:Saving parameters for layer ip3
08-Sep 08:25:16:DEBUG:root:Entering solver loop
08-Sep 08:25:17:INFO:root:ITER = 000100:: TRAIN obj-val = 1.86780834:: LR = 0.00496283:: MOM = 0.95000000
08-Sep 08:25:18:INFO:root:ITER = 000200:: TRAIN obj-val = 1.40843296:: LR = 0.00492629:: MOM = 0.95000000
08-Sep 08:25:19:INFO:root:ITER = 000300:: TRAIN obj-val = 1.00604618:: LR = 0.00489037:: MOM = 0.95000000
08-Sep 08:25:20:INFO:root:ITER = 000400:: TRAIN obj-val = 0.94527847:: LR = 0.00485506:: MOM = 0.95000000
08-Sep 08:25:21:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0/snapshot-000000.jld
08-Sep 08:25:21:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:21:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:21:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:21:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:21:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:21:DEBUG:root:Init network SVHN-validation-prediction
[nunits_fc1=2400, nunits_fc2=1200, conv1_nfilt=48, conv2_nfilt=128, base_mom=0.95, base_lr=0.005, regu_coef=0.0] => 0.01910828025477707
08-Sep 08:25:22:INFO:root:Initializing CuDNN backend...
INFO: CuDNN backend initialized!
08-Sep 08:25:22:INFO:root:Constructing net SVHN-train on GPUBackend...
08-Sep 08:25:22:INFO:root:Topological sorting 12 layers...
08-Sep 08:25:22:INFO:root:Setup layers...
08-Sep 08:25:22:INFO:root:Network constructed!
08-Sep 08:25:22:INFO:root:Constructing net SVHN-train-prediction on GPUBackend...
08-Sep 08:25:22:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:22:INFO:root:Setup layers...
08-Sep 08:25:22:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:22:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:22:INFO:root:Network constructed!
08-Sep 08:25:22:INFO:root:Constructing net SVHN-validation-prediction on GPUBackend...
08-Sep 08:25:22:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:22:INFO:root:Setup layers...
08-Sep 08:25:22:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:22:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:22:INFO:root:Network constructed!
08-Sep 08:25:22:DEBUG:root:Checking network topology for back-propagation
08-Sep 08:25:22:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.05_0.0/snapshot-000000.jld
08-Sep 08:25:22:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:22:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:22:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:22:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:22:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:22:DEBUG:root:Init network SVHN-train
08-Sep 08:25:22:DEBUG:root:Initializing coffee breaks
08-Sep 08:25:22:INFO:root:Merging existing coffee lounge statistics in snapshot_drop_conv_48_128_2400_1200_0.95_0.05_0.0/statistics.jld
08-Sep 08:25:22:DEBUG:root:Init network SVHN-train-prediction
08-Sep 08:25:22:DEBUG:root:Init network SVHN-validation-prediction
08-Sep 08:25:22:INFO:root:Snapshot directory snapshot_drop_conv_48_128_2400_1200_0.95_0.05_0.0 already exists
08-Sep 08:25:22:INFO:root:ITER = 000000:: TRAIN obj-val = 5.18731737:: LR = 0.05000000:: MOM = 0.95000000
08-Sep 08:25:22:INFO:root:
08-Sep 08:25:22:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:22:INFO:root:---------------------------------------------------------
08-Sep 08:25:22:INFO:root:  Accuracy (avg over 6220) = 1.2540%
08-Sep 08:25:22:INFO:root:---------------------------------------------------------
08-Sep 08:25:22:INFO:root:
08-Sep 08:25:22:INFO:root:
08-Sep 08:25:22:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:22:INFO:root:---------------------------------------------------------
08-Sep 08:25:22:INFO:root:  Accuracy (avg over 785) = 1.5287%
08-Sep 08:25:22:INFO:root:---------------------------------------------------------
08-Sep 08:25:22:INFO:root:
08-Sep 08:25:22:INFO:root:Saving snapshot to snapshot-000000.jld...
08-Sep 08:25:22:WARNING:root:Overwriting snapshot_drop_conv_48_128_2400_1200_0.95_0.05_0.0/snapshot-000000.jld...
08-Sep 08:25:22:DEBUG:root:Saving parameters for layer conv1
08-Sep 08:25:22:DEBUG:root:Saving parameters for layer conv2
08-Sep 08:25:22:DEBUG:root:Saving parameters for layer ip1
08-Sep 08:25:22:DEBUG:root:Saving parameters for layer ip2
08-Sep 08:25:22:DEBUG:root:Saving parameters for layer ip3
08-Sep 08:25:22:DEBUG:root:Entering solver loop
08-Sep 08:25:23:INFO:root:ITER = 000100:: TRAIN obj-val = 3.63415885:: LR = 0.04962825:: MOM = 0.95000000
08-Sep 08:25:24:INFO:root:ITER = 000200:: TRAIN obj-val = 3.76893902:: LR = 0.04926289:: MOM = 0.95000000
08-Sep 08:25:25:INFO:root:ITER = 000300:: TRAIN obj-val = 3.72220206:: LR = 0.04890374:: MOM = 0.95000000
08-Sep 08:25:26:INFO:root:ITER = 000400:: TRAIN obj-val = 3.63981318:: LR = 0.04855064:: MOM = 0.95000000
08-Sep 08:25:27:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.05_0.0/snapshot-000000.jld
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:27:DEBUG:root:Init network SVHN-validation-prediction
[nunits_fc1=2400, nunits_fc2=1200, conv1_nfilt=48, conv2_nfilt=128, base_mom=0.95, base_lr=0.05, regu_coef=0.0] => 0.015286624203821656
08-Sep 08:25:27:INFO:root:Initializing CuDNN backend...
INFO: CuDNN backend initialized!
08-Sep 08:25:27:INFO:root:Constructing net SVHN-train on GPUBackend...
08-Sep 08:25:27:INFO:root:Topological sorting 12 layers...
08-Sep 08:25:27:INFO:root:Setup layers...
08-Sep 08:25:27:INFO:root:Network constructed!
08-Sep 08:25:27:INFO:root:Constructing net SVHN-train-prediction on GPUBackend...
08-Sep 08:25:27:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:27:INFO:root:Setup layers...
08-Sep 08:25:27:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:27:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:27:INFO:root:Network constructed!
08-Sep 08:25:27:INFO:root:Constructing net SVHN-validation-prediction on GPUBackend...
08-Sep 08:25:27:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:27:INFO:root:Setup layers...
08-Sep 08:25:27:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:27:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:27:INFO:root:Network constructed!
08-Sep 08:25:27:DEBUG:root:Checking network topology for back-propagation
08-Sep 08:25:27:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.5_0.0/snapshot-000000.jld
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:27:DEBUG:root:Init network SVHN-train
08-Sep 08:25:27:DEBUG:root:Initializing coffee breaks
08-Sep 08:25:27:INFO:root:Merging existing coffee lounge statistics in snapshot_drop_conv_48_128_2400_1200_0.95_0.5_0.0/statistics.jld
08-Sep 08:25:27:DEBUG:root:Init network SVHN-train-prediction
08-Sep 08:25:27:DEBUG:root:Init network SVHN-validation-prediction
08-Sep 08:25:27:INFO:root:Snapshot directory snapshot_drop_conv_48_128_2400_1200_0.95_0.5_0.0 already exists
08-Sep 08:25:27:INFO:root:ITER = 000000:: TRAIN obj-val = 4.78694630:: LR = 0.50000000:: MOM = 0.95000000
08-Sep 08:25:27:INFO:root:
08-Sep 08:25:27:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:27:INFO:root:---------------------------------------------------------
08-Sep 08:25:27:INFO:root:  Accuracy (avg over 6220) = 3.2797%
08-Sep 08:25:27:INFO:root:---------------------------------------------------------
08-Sep 08:25:27:INFO:root:
08-Sep 08:25:27:INFO:root:
08-Sep 08:25:27:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:27:INFO:root:---------------------------------------------------------
08-Sep 08:25:27:INFO:root:  Accuracy (avg over 785) = 2.0382%
08-Sep 08:25:27:INFO:root:---------------------------------------------------------
08-Sep 08:25:27:INFO:root:
08-Sep 08:25:27:INFO:root:Saving snapshot to snapshot-000000.jld...
08-Sep 08:25:27:WARNING:root:Overwriting snapshot_drop_conv_48_128_2400_1200_0.95_0.5_0.0/snapshot-000000.jld...
08-Sep 08:25:27:DEBUG:root:Saving parameters for layer conv1
08-Sep 08:25:27:DEBUG:root:Saving parameters for layer conv2
08-Sep 08:25:27:DEBUG:root:Saving parameters for layer ip1
08-Sep 08:25:27:DEBUG:root:Saving parameters for layer ip2
08-Sep 08:25:27:DEBUG:root:Saving parameters for layer ip3
08-Sep 08:25:27:DEBUG:root:Entering solver loop
08-Sep 08:25:28:INFO:root:ITER = 000100:: TRAIN obj-val = 46.05174255:: LR = 0.49628251:: MOM = 0.95000000
08-Sep 08:25:29:INFO:root:ITER = 000200:: TRAIN obj-val = 46.05174255:: LR = 0.49262889:: MOM = 0.95000000
08-Sep 08:25:30:INFO:root:ITER = 000300:: TRAIN obj-val = 46.05174255:: LR = 0.48903741:: MOM = 0.95000000
08-Sep 08:25:31:INFO:root:ITER = 000400:: TRAIN obj-val = 46.05174255:: LR = 0.48550645:: MOM = 0.95000000
08-Sep 08:25:32:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.5_0.0/snapshot-000000.jld
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:32:DEBUG:root:Init network SVHN-validation-prediction
[nunits_fc1=2400, nunits_fc2=1200, conv1_nfilt=48, conv2_nfilt=128, base_mom=0.95, base_lr=0.5, regu_coef=0.0] => 0.02038216560509554
08-Sep 08:25:32:INFO:root:Initializing CuDNN backend...
INFO: CuDNN backend initialized!
08-Sep 08:25:32:INFO:root:Constructing net SVHN-train on GPUBackend...
08-Sep 08:25:32:INFO:root:Topological sorting 12 layers...
08-Sep 08:25:32:INFO:root:Setup layers...
08-Sep 08:25:32:INFO:root:Network constructed!
08-Sep 08:25:32:INFO:root:Constructing net SVHN-train-prediction on GPUBackend...
08-Sep 08:25:32:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:32:INFO:root:Setup layers...
08-Sep 08:25:32:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:32:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:32:INFO:root:Network constructed!
08-Sep 08:25:32:INFO:root:Constructing net SVHN-validation-prediction on GPUBackend...
08-Sep 08:25:32:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:32:INFO:root:Setup layers...
08-Sep 08:25:32:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:32:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:32:INFO:root:Network constructed!
08-Sep 08:25:32:DEBUG:root:Checking network topology for back-propagation
08-Sep 08:25:32:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0001/snapshot-000000.jld
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:32:DEBUG:root:Init network SVHN-train
08-Sep 08:25:32:DEBUG:root:Initializing coffee breaks
08-Sep 08:25:32:INFO:root:Merging existing coffee lounge statistics in snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0001/statistics.jld
08-Sep 08:25:32:DEBUG:root:Init network SVHN-train-prediction
08-Sep 08:25:32:DEBUG:root:Init network SVHN-validation-prediction
08-Sep 08:25:32:INFO:root:Snapshot directory snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0001 already exists
08-Sep 08:25:32:INFO:root:ITER = 000000:: TRAIN obj-val = 4.99891996:: LR = 0.00500000:: MOM = 0.95000000
08-Sep 08:25:33:INFO:root:
08-Sep 08:25:33:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:33:INFO:root:---------------------------------------------------------
08-Sep 08:25:33:INFO:root:  Accuracy (avg over 6220) = 2.2669%
08-Sep 08:25:33:INFO:root:---------------------------------------------------------
08-Sep 08:25:33:INFO:root:
08-Sep 08:25:33:INFO:root:
08-Sep 08:25:33:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:33:INFO:root:---------------------------------------------------------
08-Sep 08:25:33:INFO:root:  Accuracy (avg over 785) = 2.2930%
08-Sep 08:25:33:INFO:root:---------------------------------------------------------
08-Sep 08:25:33:INFO:root:
08-Sep 08:25:33:INFO:root:Saving snapshot to snapshot-000000.jld...
08-Sep 08:25:33:WARNING:root:Overwriting snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0001/snapshot-000000.jld...
08-Sep 08:25:33:DEBUG:root:Saving parameters for layer conv1
08-Sep 08:25:33:DEBUG:root:Saving parameters for layer conv2
08-Sep 08:25:33:DEBUG:root:Saving parameters for layer ip1
08-Sep 08:25:33:DEBUG:root:Saving parameters for layer ip2
08-Sep 08:25:33:DEBUG:root:Saving parameters for layer ip3
08-Sep 08:25:33:DEBUG:root:Entering solver loop
08-Sep 08:25:34:INFO:root:ITER = 000100:: TRAIN obj-val = 2.00898957:: LR = 0.00496283:: MOM = 0.95000000
08-Sep 08:25:35:INFO:root:ITER = 000200:: TRAIN obj-val = 1.41978502:: LR = 0.00492629:: MOM = 0.95000000
08-Sep 08:25:36:INFO:root:ITER = 000300:: TRAIN obj-val = 0.98986107:: LR = 0.00489037:: MOM = 0.95000000
08-Sep 08:25:37:INFO:root:ITER = 000400:: TRAIN obj-val = 0.85035235:: LR = 0.00485506:: MOM = 0.95000000
08-Sep 08:25:37:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0001/snapshot-000000.jld
08-Sep 08:25:37:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:37:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:37:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:37:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:37:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:37:DEBUG:root:Init network SVHN-validation-prediction
[nunits_fc1=2400, nunits_fc2=1200, conv1_nfilt=48, conv2_nfilt=128, base_mom=0.95, base_lr=0.005, regu_coef=0.0001] => 0.022929936305732482
08-Sep 08:25:37:INFO:root:Initializing CuDNN backend...
INFO: CuDNN backend initialized!
08-Sep 08:25:38:INFO:root:Constructing net SVHN-train on GPUBackend...
08-Sep 08:25:38:INFO:root:Topological sorting 12 layers...
08-Sep 08:25:38:INFO:root:Setup layers...
Out of memory
while loading In[14], in expression starting on line 1

 in launch at /home/quetzal/.julia/v0.3/Mocha/src/cuda/cuda.jl:80
 in setup_etc at /home/quetzal/.julia/v0.3/Mocha/src/cuda/layers/dropout.jl:14
 in setup at /home/quetzal/.julia/v0.3/Mocha/src/layers/dropout.jl:42
 in setup at /home/quetzal/.julia/v0.3/Mocha/src/layers.jl:110
 in Net at /home/quetzal/.julia/v0.3/Mocha/src/net.jl:227
 in configure_training at In[8]:25
 in estfun at In[13]:6
 in gridtune at /home/quetzal/.julia/v0.3/MLBase/src/modeltune.jl:21
nikolaypavlov commented 9 years ago

BTW, the shutdown() works well. The only thing, if you get some error during the grid search, for example, when some of hyperparams in a set are incompatible, you have to restart iPython kernel to avoid "out of memory". Because the backend was not shutdown properly and you have to rerun everything. This is really annoying... :(

nikolaypavlov commented 9 years ago

Just curious, can you make the backend initialization implicit?. Why not to hide init() and shutdown() into the functions like solve() and forward_epoch()?

pluskid commented 9 years ago

@nikolaypavlov I looked at your code snippet again. Did you call destroy on the net after you finish training? If you did not, then the resources will not get released (esp. GPU memory).

I'm sorry this is kind of inconvenient but currently Julia does not have very good RAII for managing resources automatically.

droidicus commented 9 years ago

@pluskid I am having a similar Out of Memory problem. Working with a ~500 MB data set, I found that it I set the batch size to be the same as the data size, the network would successfully train, but I would get Out of Memory errors on the post-training processing that I was doing. I ended up needing to lower the batch size in order to avoid the errors. My video card has 6 GB of VRAM, so this size of dataset shouldn't be a large problem.

ashleylid commented 8 years ago

@droidicus Hi. Could you just explain a little more how you used your batch size wrt data?

I have 1000 data points and labels as inputs. I was thinking I needed a smaller batch size of like 10 but getting the Out of Memory thing. My CPU implementation works fine though (at least it loads the net)

colbec commented 6 years ago

FWIW I also have this error on Julia 0.6 official release, Mocha 0.3.1 on openSUSE 42.3; I have 15 GB memory and using the Gnome system utility I can see the loading of the project rapidly eat up all available memory and fail. The last message from Mocha before the error is "Network constructed!" My project adapts the convolution tutorial exercise with MNIST data to my own set of RGB png images 900x900 with m = 40 training examples and 40 test images. Reducing the batch size to 2 seems to have no effect. The same data runs correctly in Python Keras. Edit: using CPU only backend.