Open nikolaypavlov opened 9 years ago
Is the out of memory error from cpu memory or gpu memory? Gpu memory is separated from cpu memory and is typically much smaller.
Exact error message is the following:
31-Aug 02:31:47:INFO:root:Constructing net SVHN-train on GPUBackend...
31-Aug 02:31:47:INFO:root:Topological sorting 14 layers...
31-Aug 02:31:47:INFO:root:Setup layers...
31-Aug 02:31:49:INFO:root:Network constructed!
31-Aug 02:31:49:INFO:root:Constructing net SVHN-train-prediction on GPUBackend...
31-Aug 02:31:49:INFO:root:Topological sorting 9 layers...
31-Aug 02:31:49:INFO:root:Setup layers...
Out of memory
while loading In[28], in expression starting on line 1
in cualloc at /home/quetzal/.julia/v0.3/Mocha/src/cuda/cuda.jl:80
in make_blob at /home/quetzal/.julia/v0.3/Mocha/src/cuda/blob.jl:42
in make_blob at /home/quetzal/.julia/v0.3/Mocha/src/blob.jl:102
in ConvolutionLayerState at /home/quetzal/.julia/v0.3/Mocha/src/layers/convolution.jl:96
in setup at /home/quetzal/.julia/v0.3/Mocha/src/layers/convolution.jl:158
in Net at /home/quetzal/.julia/v0.3/Mocha/src/net.jl:227
in configure_training at In[14]:33
in estfun at In[23]:6
in gridtune at /home/quetzal/.julia/v0.3/MLBase/src/modeltune.jl:21
BTW it only happens if I'm running grid search, so the network needs to be run at least once before it fails..
Here is also htop output:
At the same time the CPU backend memory usage for the same network is very low:
Here is also my video card info:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX TITAN X"
CUDA Driver Version / Runtime Version 7.0 / 6.5
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 12288 MBytes (12884574208 bytes)
(24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores
GPU Clock rate: 1216 MHz (1.22 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = GeForce GTX TITAN X
Result = PASS
Ok. It seems like I have found a workaround. I have to init and shutdown the GPU backend between each new set of hyperparameters. Though the actual implementation looks weird, because of how the gridtune()
works:
using MLBase
function evalfun(netInfo)
pred = predict(netInfo[:valid_net], netInfo[:base_dir])
model_perfomance = mean(pred .== netInfo[:validLabels])
backend = netInfo[:net].backend
shutdown(backend)
return(model_perfomance)
end
function estfun(nunits_fc1, nunits_fc2, conv1_nfilt, conv2_nfilt, base_mom, base_lr, regu_coef)
backend = use_gpu ? GPUBackend() : CPUBackend()
init(backend)
snapshot_dir = "snapshot_drop_conv_$(conv1_nfilt)_$(conv2_nfilt)_$(nunits_fc1)_$(nunits_fc2)_$(base_mom)_$(base_lr)_$(regu_coef)"
net, train_net, valid_net, common_layers = configure_training(backend, nunits_fc1, nunits_fc2, conv1_nfilt, conv2_nfilt)
solver = configure_solver(MAXITER, base_mom, base_lr, EPOCH, snapshot_dir, regu_coef)
configure_coffebreaks(solver, train_net, valid_net, snapshot_dir)
solve(solver, net)
model = {:net => net,
:valid_net => valid_net,
:base_dir => snapshot_dir,
:validLabels => labelsInfoValid[:Labels],
:common_layers => common_layers}
return(model)
end
best_model, best_cfg, best_score = gridtune(estfun, evalfun, nunits_fc1, nunits_fc2, conv1_nfilt, conv2_nfilt, base_mom, base_lr, regu_coef; verbose=true)
Can you try to see if calling this function registry_reset(backend::Backend) instead of shutdown and init it would help?
It doesn't work. It fails after several iterations.
08-Sep 08:25:09:INFO:root:Initializing CuDNN backend...
INFO: CuDNN backend initialized!
08-Sep 08:25:10:INFO:root:Constructing net SVHN-train on GPUBackend...
08-Sep 08:25:10:INFO:root:Topological sorting 12 layers...
08-Sep 08:25:10:INFO:root:Setup layers...
08-Sep 08:25:11:INFO:root:Network constructed!
08-Sep 08:25:11:INFO:root:Constructing net SVHN-train-prediction on GPUBackend...
08-Sep 08:25:11:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:11:INFO:root:Setup layers...
08-Sep 08:25:11:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:11:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:11:INFO:root:Network constructed!
08-Sep 08:25:11:INFO:root:Constructing net SVHN-validation-prediction on GPUBackend...
08-Sep 08:25:11:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:11:INFO:root:Setup layers...
08-Sep 08:25:11:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:11:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:11:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:11:INFO:root:Network constructed!
08-Sep 08:25:11:DEBUG:root:Checking network topology for back-propagation
08-Sep 08:25:11:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0/snapshot-000000.jld
08-Sep 08:25:12:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:12:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:12:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:12:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:12:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:12:DEBUG:root:Init network SVHN-train
08-Sep 08:25:13:DEBUG:root:Initializing coffee breaks
08-Sep 08:25:13:INFO:root:Merging existing coffee lounge statistics in snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0/statistics.jld
08-Sep 08:25:13:DEBUG:root:Init network SVHN-train-prediction
08-Sep 08:25:13:DEBUG:root:Init network SVHN-validation-prediction
08-Sep 08:25:13:INFO:root:Snapshot directory snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0 already exists
08-Sep 08:25:13:INFO:root:ITER = 000000:: TRAIN obj-val = 5.19213486:: LR = 0.00500000:: MOM = 0.95000000
08-Sep 08:25:14:INFO:root:
08-Sep 08:25:14:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:14:INFO:root:---------------------------------------------------------
08-Sep 08:25:14:INFO:root: Accuracy (avg over 6220) = 1.5273%
08-Sep 08:25:14:INFO:root:---------------------------------------------------------
08-Sep 08:25:14:INFO:root:
08-Sep 08:25:14:INFO:root:
08-Sep 08:25:14:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:14:INFO:root:---------------------------------------------------------
08-Sep 08:25:14:INFO:root: Accuracy (avg over 785) = 1.9108%
08-Sep 08:25:14:INFO:root:---------------------------------------------------------
08-Sep 08:25:14:INFO:root:
08-Sep 08:25:14:INFO:root:Saving snapshot to snapshot-000000.jld...
08-Sep 08:25:14:WARNING:root:Overwriting snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0/snapshot-000000.jld...
08-Sep 08:25:14:DEBUG:root:Saving parameters for layer conv1
08-Sep 08:25:14:DEBUG:root:Saving parameters for layer conv2
08-Sep 08:25:14:DEBUG:root:Saving parameters for layer ip1
08-Sep 08:25:14:DEBUG:root:Saving parameters for layer ip2
08-Sep 08:25:14:DEBUG:root:Saving parameters for layer ip3
08-Sep 08:25:16:DEBUG:root:Entering solver loop
08-Sep 08:25:17:INFO:root:ITER = 000100:: TRAIN obj-val = 1.86780834:: LR = 0.00496283:: MOM = 0.95000000
08-Sep 08:25:18:INFO:root:ITER = 000200:: TRAIN obj-val = 1.40843296:: LR = 0.00492629:: MOM = 0.95000000
08-Sep 08:25:19:INFO:root:ITER = 000300:: TRAIN obj-val = 1.00604618:: LR = 0.00489037:: MOM = 0.95000000
08-Sep 08:25:20:INFO:root:ITER = 000400:: TRAIN obj-val = 0.94527847:: LR = 0.00485506:: MOM = 0.95000000
08-Sep 08:25:21:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0/snapshot-000000.jld
08-Sep 08:25:21:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:21:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:21:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:21:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:21:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:21:DEBUG:root:Init network SVHN-validation-prediction
[nunits_fc1=2400, nunits_fc2=1200, conv1_nfilt=48, conv2_nfilt=128, base_mom=0.95, base_lr=0.005, regu_coef=0.0] => 0.01910828025477707
08-Sep 08:25:22:INFO:root:Initializing CuDNN backend...
INFO: CuDNN backend initialized!
08-Sep 08:25:22:INFO:root:Constructing net SVHN-train on GPUBackend...
08-Sep 08:25:22:INFO:root:Topological sorting 12 layers...
08-Sep 08:25:22:INFO:root:Setup layers...
08-Sep 08:25:22:INFO:root:Network constructed!
08-Sep 08:25:22:INFO:root:Constructing net SVHN-train-prediction on GPUBackend...
08-Sep 08:25:22:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:22:INFO:root:Setup layers...
08-Sep 08:25:22:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:22:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:22:INFO:root:Network constructed!
08-Sep 08:25:22:INFO:root:Constructing net SVHN-validation-prediction on GPUBackend...
08-Sep 08:25:22:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:22:INFO:root:Setup layers...
08-Sep 08:25:22:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:22:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:22:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:22:INFO:root:Network constructed!
08-Sep 08:25:22:DEBUG:root:Checking network topology for back-propagation
08-Sep 08:25:22:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.05_0.0/snapshot-000000.jld
08-Sep 08:25:22:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:22:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:22:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:22:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:22:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:22:DEBUG:root:Init network SVHN-train
08-Sep 08:25:22:DEBUG:root:Initializing coffee breaks
08-Sep 08:25:22:INFO:root:Merging existing coffee lounge statistics in snapshot_drop_conv_48_128_2400_1200_0.95_0.05_0.0/statistics.jld
08-Sep 08:25:22:DEBUG:root:Init network SVHN-train-prediction
08-Sep 08:25:22:DEBUG:root:Init network SVHN-validation-prediction
08-Sep 08:25:22:INFO:root:Snapshot directory snapshot_drop_conv_48_128_2400_1200_0.95_0.05_0.0 already exists
08-Sep 08:25:22:INFO:root:ITER = 000000:: TRAIN obj-val = 5.18731737:: LR = 0.05000000:: MOM = 0.95000000
08-Sep 08:25:22:INFO:root:
08-Sep 08:25:22:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:22:INFO:root:---------------------------------------------------------
08-Sep 08:25:22:INFO:root: Accuracy (avg over 6220) = 1.2540%
08-Sep 08:25:22:INFO:root:---------------------------------------------------------
08-Sep 08:25:22:INFO:root:
08-Sep 08:25:22:INFO:root:
08-Sep 08:25:22:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:22:INFO:root:---------------------------------------------------------
08-Sep 08:25:22:INFO:root: Accuracy (avg over 785) = 1.5287%
08-Sep 08:25:22:INFO:root:---------------------------------------------------------
08-Sep 08:25:22:INFO:root:
08-Sep 08:25:22:INFO:root:Saving snapshot to snapshot-000000.jld...
08-Sep 08:25:22:WARNING:root:Overwriting snapshot_drop_conv_48_128_2400_1200_0.95_0.05_0.0/snapshot-000000.jld...
08-Sep 08:25:22:DEBUG:root:Saving parameters for layer conv1
08-Sep 08:25:22:DEBUG:root:Saving parameters for layer conv2
08-Sep 08:25:22:DEBUG:root:Saving parameters for layer ip1
08-Sep 08:25:22:DEBUG:root:Saving parameters for layer ip2
08-Sep 08:25:22:DEBUG:root:Saving parameters for layer ip3
08-Sep 08:25:22:DEBUG:root:Entering solver loop
08-Sep 08:25:23:INFO:root:ITER = 000100:: TRAIN obj-val = 3.63415885:: LR = 0.04962825:: MOM = 0.95000000
08-Sep 08:25:24:INFO:root:ITER = 000200:: TRAIN obj-val = 3.76893902:: LR = 0.04926289:: MOM = 0.95000000
08-Sep 08:25:25:INFO:root:ITER = 000300:: TRAIN obj-val = 3.72220206:: LR = 0.04890374:: MOM = 0.95000000
08-Sep 08:25:26:INFO:root:ITER = 000400:: TRAIN obj-val = 3.63981318:: LR = 0.04855064:: MOM = 0.95000000
08-Sep 08:25:27:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.05_0.0/snapshot-000000.jld
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:27:DEBUG:root:Init network SVHN-validation-prediction
[nunits_fc1=2400, nunits_fc2=1200, conv1_nfilt=48, conv2_nfilt=128, base_mom=0.95, base_lr=0.05, regu_coef=0.0] => 0.015286624203821656
08-Sep 08:25:27:INFO:root:Initializing CuDNN backend...
INFO: CuDNN backend initialized!
08-Sep 08:25:27:INFO:root:Constructing net SVHN-train on GPUBackend...
08-Sep 08:25:27:INFO:root:Topological sorting 12 layers...
08-Sep 08:25:27:INFO:root:Setup layers...
08-Sep 08:25:27:INFO:root:Network constructed!
08-Sep 08:25:27:INFO:root:Constructing net SVHN-train-prediction on GPUBackend...
08-Sep 08:25:27:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:27:INFO:root:Setup layers...
08-Sep 08:25:27:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:27:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:27:INFO:root:Network constructed!
08-Sep 08:25:27:INFO:root:Constructing net SVHN-validation-prediction on GPUBackend...
08-Sep 08:25:27:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:27:INFO:root:Setup layers...
08-Sep 08:25:27:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:27:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:27:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:27:INFO:root:Network constructed!
08-Sep 08:25:27:DEBUG:root:Checking network topology for back-propagation
08-Sep 08:25:27:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.5_0.0/snapshot-000000.jld
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:27:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:27:DEBUG:root:Init network SVHN-train
08-Sep 08:25:27:DEBUG:root:Initializing coffee breaks
08-Sep 08:25:27:INFO:root:Merging existing coffee lounge statistics in snapshot_drop_conv_48_128_2400_1200_0.95_0.5_0.0/statistics.jld
08-Sep 08:25:27:DEBUG:root:Init network SVHN-train-prediction
08-Sep 08:25:27:DEBUG:root:Init network SVHN-validation-prediction
08-Sep 08:25:27:INFO:root:Snapshot directory snapshot_drop_conv_48_128_2400_1200_0.95_0.5_0.0 already exists
08-Sep 08:25:27:INFO:root:ITER = 000000:: TRAIN obj-val = 4.78694630:: LR = 0.50000000:: MOM = 0.95000000
08-Sep 08:25:27:INFO:root:
08-Sep 08:25:27:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:27:INFO:root:---------------------------------------------------------
08-Sep 08:25:27:INFO:root: Accuracy (avg over 6220) = 3.2797%
08-Sep 08:25:27:INFO:root:---------------------------------------------------------
08-Sep 08:25:27:INFO:root:
08-Sep 08:25:27:INFO:root:
08-Sep 08:25:27:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:27:INFO:root:---------------------------------------------------------
08-Sep 08:25:27:INFO:root: Accuracy (avg over 785) = 2.0382%
08-Sep 08:25:27:INFO:root:---------------------------------------------------------
08-Sep 08:25:27:INFO:root:
08-Sep 08:25:27:INFO:root:Saving snapshot to snapshot-000000.jld...
08-Sep 08:25:27:WARNING:root:Overwriting snapshot_drop_conv_48_128_2400_1200_0.95_0.5_0.0/snapshot-000000.jld...
08-Sep 08:25:27:DEBUG:root:Saving parameters for layer conv1
08-Sep 08:25:27:DEBUG:root:Saving parameters for layer conv2
08-Sep 08:25:27:DEBUG:root:Saving parameters for layer ip1
08-Sep 08:25:27:DEBUG:root:Saving parameters for layer ip2
08-Sep 08:25:27:DEBUG:root:Saving parameters for layer ip3
08-Sep 08:25:27:DEBUG:root:Entering solver loop
08-Sep 08:25:28:INFO:root:ITER = 000100:: TRAIN obj-val = 46.05174255:: LR = 0.49628251:: MOM = 0.95000000
08-Sep 08:25:29:INFO:root:ITER = 000200:: TRAIN obj-val = 46.05174255:: LR = 0.49262889:: MOM = 0.95000000
08-Sep 08:25:30:INFO:root:ITER = 000300:: TRAIN obj-val = 46.05174255:: LR = 0.48903741:: MOM = 0.95000000
08-Sep 08:25:31:INFO:root:ITER = 000400:: TRAIN obj-val = 46.05174255:: LR = 0.48550645:: MOM = 0.95000000
08-Sep 08:25:32:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.5_0.0/snapshot-000000.jld
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:32:DEBUG:root:Init network SVHN-validation-prediction
[nunits_fc1=2400, nunits_fc2=1200, conv1_nfilt=48, conv2_nfilt=128, base_mom=0.95, base_lr=0.5, regu_coef=0.0] => 0.02038216560509554
08-Sep 08:25:32:INFO:root:Initializing CuDNN backend...
INFO: CuDNN backend initialized!
08-Sep 08:25:32:INFO:root:Constructing net SVHN-train on GPUBackend...
08-Sep 08:25:32:INFO:root:Topological sorting 12 layers...
08-Sep 08:25:32:INFO:root:Setup layers...
08-Sep 08:25:32:INFO:root:Network constructed!
08-Sep 08:25:32:INFO:root:Constructing net SVHN-train-prediction on GPUBackend...
08-Sep 08:25:32:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:32:INFO:root:Setup layers...
08-Sep 08:25:32:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:32:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:32:INFO:root:Network constructed!
08-Sep 08:25:32:INFO:root:Constructing net SVHN-validation-prediction on GPUBackend...
08-Sep 08:25:32:INFO:root:Topological sorting 9 layers...
08-Sep 08:25:32:INFO:root:Setup layers...
08-Sep 08:25:32:DEBUG:root:ConvolutionLayer(conv1): sharing filters and bias
08-Sep 08:25:32:DEBUG:root:ConvolutionLayer(conv2): sharing filters and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip1): sharing weights and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip2): sharing weights and bias
08-Sep 08:25:32:DEBUG:root:InnerProductLayer(ip3): sharing weights and bias
08-Sep 08:25:32:INFO:root:Network constructed!
08-Sep 08:25:32:DEBUG:root:Checking network topology for back-propagation
08-Sep 08:25:32:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0001/snapshot-000000.jld
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:32:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:32:DEBUG:root:Init network SVHN-train
08-Sep 08:25:32:DEBUG:root:Initializing coffee breaks
08-Sep 08:25:32:INFO:root:Merging existing coffee lounge statistics in snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0001/statistics.jld
08-Sep 08:25:32:DEBUG:root:Init network SVHN-train-prediction
08-Sep 08:25:32:DEBUG:root:Init network SVHN-validation-prediction
08-Sep 08:25:32:INFO:root:Snapshot directory snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0001 already exists
08-Sep 08:25:32:INFO:root:ITER = 000000:: TRAIN obj-val = 4.99891996:: LR = 0.00500000:: MOM = 0.95000000
08-Sep 08:25:33:INFO:root:
08-Sep 08:25:33:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:33:INFO:root:---------------------------------------------------------
08-Sep 08:25:33:INFO:root: Accuracy (avg over 6220) = 2.2669%
08-Sep 08:25:33:INFO:root:---------------------------------------------------------
08-Sep 08:25:33:INFO:root:
08-Sep 08:25:33:INFO:root:
08-Sep 08:25:33:INFO:root:## Performance on Validation Set after 0 iterations
08-Sep 08:25:33:INFO:root:---------------------------------------------------------
08-Sep 08:25:33:INFO:root: Accuracy (avg over 785) = 2.2930%
08-Sep 08:25:33:INFO:root:---------------------------------------------------------
08-Sep 08:25:33:INFO:root:
08-Sep 08:25:33:INFO:root:Saving snapshot to snapshot-000000.jld...
08-Sep 08:25:33:WARNING:root:Overwriting snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0001/snapshot-000000.jld...
08-Sep 08:25:33:DEBUG:root:Saving parameters for layer conv1
08-Sep 08:25:33:DEBUG:root:Saving parameters for layer conv2
08-Sep 08:25:33:DEBUG:root:Saving parameters for layer ip1
08-Sep 08:25:33:DEBUG:root:Saving parameters for layer ip2
08-Sep 08:25:33:DEBUG:root:Saving parameters for layer ip3
08-Sep 08:25:33:DEBUG:root:Entering solver loop
08-Sep 08:25:34:INFO:root:ITER = 000100:: TRAIN obj-val = 2.00898957:: LR = 0.00496283:: MOM = 0.95000000
08-Sep 08:25:35:INFO:root:ITER = 000200:: TRAIN obj-val = 1.41978502:: LR = 0.00492629:: MOM = 0.95000000
08-Sep 08:25:36:INFO:root:ITER = 000300:: TRAIN obj-val = 0.98986107:: LR = 0.00489037:: MOM = 0.95000000
08-Sep 08:25:37:INFO:root:ITER = 000400:: TRAIN obj-val = 0.85035235:: LR = 0.00485506:: MOM = 0.95000000
08-Sep 08:25:37:INFO:root:Loading existing model from snapshot_drop_conv_48_128_2400_1200_0.95_0.005_0.0001/snapshot-000000.jld
08-Sep 08:25:37:DEBUG:root:Loading parameters for layer conv1
08-Sep 08:25:37:DEBUG:root:Loading parameters for layer conv2
08-Sep 08:25:37:DEBUG:root:Loading parameters for layer ip1
08-Sep 08:25:37:DEBUG:root:Loading parameters for layer ip2
08-Sep 08:25:37:DEBUG:root:Loading parameters for layer ip3
08-Sep 08:25:37:DEBUG:root:Init network SVHN-validation-prediction
[nunits_fc1=2400, nunits_fc2=1200, conv1_nfilt=48, conv2_nfilt=128, base_mom=0.95, base_lr=0.005, regu_coef=0.0001] => 0.022929936305732482
08-Sep 08:25:37:INFO:root:Initializing CuDNN backend...
INFO: CuDNN backend initialized!
08-Sep 08:25:38:INFO:root:Constructing net SVHN-train on GPUBackend...
08-Sep 08:25:38:INFO:root:Topological sorting 12 layers...
08-Sep 08:25:38:INFO:root:Setup layers...
Out of memory
while loading In[14], in expression starting on line 1
in launch at /home/quetzal/.julia/v0.3/Mocha/src/cuda/cuda.jl:80
in setup_etc at /home/quetzal/.julia/v0.3/Mocha/src/cuda/layers/dropout.jl:14
in setup at /home/quetzal/.julia/v0.3/Mocha/src/layers/dropout.jl:42
in setup at /home/quetzal/.julia/v0.3/Mocha/src/layers.jl:110
in Net at /home/quetzal/.julia/v0.3/Mocha/src/net.jl:227
in configure_training at In[8]:25
in estfun at In[13]:6
in gridtune at /home/quetzal/.julia/v0.3/MLBase/src/modeltune.jl:21
BTW, the shutdown() works well. The only thing, if you get some error during the grid search, for example, when some of hyperparams in a set are incompatible, you have to restart iPython kernel to avoid "out of memory". Because the backend was not shutdown properly and you have to rerun everything. This is really annoying... :(
Just curious, can you make the backend initialization implicit?. Why not to hide init()
and shutdown()
into the functions like solve()
and forward_epoch()
?
@nikolaypavlov I looked at your code snippet again. Did you call destroy
on the net after you finish training? If you did not, then the resources will not get released (esp. GPU memory).
I'm sorry this is kind of inconvenient but currently Julia does not have very good RAII for managing resources automatically.
@pluskid I am having a similar Out of Memory problem. Working with a ~500 MB data set, I found that it I set the batch size to be the same as the data size, the network would successfully train, but I would get Out of Memory errors on the post-training processing that I was doing. I ended up needing to lower the batch size in order to avoid the errors. My video card has 6 GB of VRAM, so this size of dataset shouldn't be a large problem.
@droidicus Hi. Could you just explain a little more how you used your batch size wrt data?
I have 1000 data points and labels as inputs. I was thinking I needed a smaller batch size of like 10 but getting the Out of Memory thing. My CPU implementation works fine though (at least it loads the net)
FWIW I also have this error on Julia 0.6 official release, Mocha 0.3.1 on openSUSE 42.3; I have 15 GB memory and using the Gnome system utility I can see the loading of the project rapidly eat up all available memory and fail. The last message from Mocha before the error is "Network constructed!" My project adapts the convolution tutorial exercise with MNIST data to my own set of RGB png images 900x900 with m = 40 training examples and 40 test images. Reducing the batch size to 2 seems to have no effect. The same data runs correctly in Python Keras. Edit: using CPU only backend.
I'm constantly getting "out of memory" error while doing a grid search. It seems like Mocha maps tons of memory for GPU backend. If I'm using a CPU backend everything is fine and memory usage is low.
Is this a bug? Is there any workaround?