szagoruyko / cifar.torch

92.45% on CIFAR-10 in Torch
http://torch.ch/blog/2015/07/30/cifar.html
MIT License
174 stars 76 forks source link

Out of Memory Issues when Training #14

Open ghost opened 8 years ago

ghost commented 8 years ago

I'm following the steps as described in this blog entry in order to run the CIFAR classification. Preprocessing Provider.lua works fine, but training won't seem to work due to memory problems.

When running the regular command line CUDA_VISIBLE_DEVICES=0 th train.lua, I'm getting the following output:

{
  learningRate : 1
  momentum : 0.9
  epoch_step : 25
  learningRateDecay : 1e-07
  batchSize : 128
  model : "vgg_bn_drop"
  save : "logs"
  weightDecay : 0.0005
  backend : "nn"
  max_epoch : 300
}
==> configuring model   
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9668/cutorch/lib/THC/generic/THCStorage.cu line=41 error=2 : out of memory
/Users/artcfa/torch/install/bin/luajit: /Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:11: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-9668/cutorch/lib/THC/generic/THCStorage.cu:41
stack traceback:
    [C]: in function 'resize'
    /Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:11: in function 'torch_Storage_type'
    /Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:57: in function 'recursiveType'
    /Users/artcfa/torch/install/share/lua/5.1/nn/Module.lua:126: in function 'type'
    /Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:45: in function 'recursiveType'
    /Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:41: in function 'recursiveType'
    /Users/artcfa/torch/install/share/lua/5.1/nn/Module.lua:126: in function 'cuda'
    train.lua:47: in main chunk
    [C]: in function 'dofile'
    ...edja/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x0107712bd0

So apparently CUDA reports to be out of memory. I've compiled the NVIDIA CUDA samples and ran the deviceQuery sample in order to get some stats on this:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 650M"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1024 MBytes (1073414144 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            900 MHz (0.90 GHz)
  Memory Clock rate:                             2508 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GT 650M
Result = PASS

So I thought maybe 1 GB of total available CUDA memory simply wasn't enough. I modified the sample code so it could run on the CPU without using CUDA (which you can find here) and now it starts up and begins training.

However, after around 9 / 390 training samples, it crashes with the message luajit not enough memory, so that doesn't appear to work either.

Am I doing something wrong? What can I do to run this?

szagoruyko commented 8 years ago

try his branch: https://github.com/szagoruyko/cifar.torch/tree/cpu

ghost commented 8 years ago

Tried it, same result as with my custom cpu code linked above. I started it with the following command line:

CUDA_VISIBLE_DEVICES=0 th train.lua --type float

Resulting in the following output:

{
  learningRate : 1
  type : "float"
  momentum : 0.9
  epoch_step : 25
  learningRateDecay : 1e-07
  batchSize : 128
  model : "vgg_bn_drop"
  save : "logs"
  backend : "nn"
  weightDecay : 0.0005
  max_epoch : 300
}
==> configuring model   
nn.Sequential {
  [input -> (1) -> (2) -> (3) -> output]
  (1): nn.BatchFlip
  (2): nn.Copy
  (3): nn.Sequential {
    [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> (40) -> (41) -> (42) -> (43) -> (44) -> (45) -> (46) -> (47) -> (48) -> (49) -> (50) -> (51) -> (52) -> (53) -> (54) -> output]
    (1): nn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
    (2): nn.SpatialBatchNormalization
    (3): nn.ReLU
    (4): nn.Dropout(0.300000)
    (5): nn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
    (6): nn.SpatialBatchNormalization
    (7): nn.ReLU
    (8): nn.SpatialMaxPooling(2x2, 2,2)
    (9): nn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
    (10): nn.SpatialBatchNormalization
    (11): nn.ReLU
    (12): nn.Dropout(0.400000)
    (13): nn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
    (14): nn.SpatialBatchNormalization
    (15): nn.ReLU
    (16): nn.SpatialMaxPooling(2x2, 2,2)
    (17): nn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
    (18): nn.SpatialBatchNormalization
    (19): nn.ReLU
    (20): nn.Dropout(0.400000)
    (21): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (22): nn.SpatialBatchNormalization
    (23): nn.ReLU
    (24): nn.Dropout(0.400000)
    (25): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (26): nn.SpatialBatchNormalization
    (27): nn.ReLU
    (28): nn.SpatialMaxPooling(2x2, 2,2)
    (29): nn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
    (30): nn.SpatialBatchNormalization
    (31): nn.ReLU
    (32): nn.Dropout(0.400000)
    (33): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (34): nn.SpatialBatchNormalization
    (35): nn.ReLU
    (36): nn.Dropout(0.400000)
    (37): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (38): nn.SpatialBatchNormalization
    (39): nn.ReLU
    (40): nn.SpatialMaxPooling(2x2, 2,2)
    (41): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (42): nn.SpatialBatchNormalization
    (43): nn.ReLU
    (44): nn.Dropout(0.400000)
    (45): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (46): nn.SpatialBatchNormalization
    (47): nn.ReLU
    (48): nn.Dropout(0.400000)
    (49): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (50): nn.SpatialBatchNormalization
    (51): nn.ReLU
    (52): nn.SpatialMaxPooling(2x2, 2,2)
    (53): nn.View(512)
    (54): nn.Sequential {
      [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> output]
      (1): nn.Dropout(0.500000)
      (2): nn.Linear(512 -> 512)
      (3): nn.BatchNormalization
      (4): nn.ReLU
      (5): nn.Dropout(0.500000)
      (6): nn.Linear(512 -> 10)
    }
  }
}
==> loading data    
Will save at logs   
==> setting criterion   
==> configuring optimizer   
==> online epoch # 1 [batchSize = 128]  
/Users/artcfa/torch/install/bin/luajit: not enough memory..................] ETA: 30m18s | Step: 5s52ms

The error happens around 10/390, long before the first epoch has finished training.

fredowski commented 8 years ago

Hi, i had the same problem on a Macbook Pro when I tried

th train.lua --type=float

The problem is related to luajit, which cannot allocate that much memory. So I installed torch with the following options:

TORCH_LUA_VERSION=LUA51 ./install.sh

which solved the problem. This will result in torch using lua instead of luajit.

panfengli commented 7 years ago

nvidia-smi Sometimes a process may occupy too much GPU memory, you can try to kill this process. I have encountered /usr/lib/xorg/Xorg occupies around 1800 Mb GPU memory.