Open ghost opened 8 years ago
try his branch: https://github.com/szagoruyko/cifar.torch/tree/cpu
Tried it, same result as with my custom cpu code linked above. I started it with the following command line:
CUDA_VISIBLE_DEVICES=0 th train.lua --type float
Resulting in the following output:
{
learningRate : 1
type : "float"
momentum : 0.9
epoch_step : 25
learningRateDecay : 1e-07
batchSize : 128
model : "vgg_bn_drop"
save : "logs"
backend : "nn"
weightDecay : 0.0005
max_epoch : 300
}
==> configuring model
nn.Sequential {
[input -> (1) -> (2) -> (3) -> output]
(1): nn.BatchFlip
(2): nn.Copy
(3): nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> (40) -> (41) -> (42) -> (43) -> (44) -> (45) -> (46) -> (47) -> (48) -> (49) -> (50) -> (51) -> (52) -> (53) -> (54) -> output]
(1): nn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
(2): nn.SpatialBatchNormalization
(3): nn.ReLU
(4): nn.Dropout(0.300000)
(5): nn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
(6): nn.SpatialBatchNormalization
(7): nn.ReLU
(8): nn.SpatialMaxPooling(2x2, 2,2)
(9): nn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
(10): nn.SpatialBatchNormalization
(11): nn.ReLU
(12): nn.Dropout(0.400000)
(13): nn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
(14): nn.SpatialBatchNormalization
(15): nn.ReLU
(16): nn.SpatialMaxPooling(2x2, 2,2)
(17): nn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
(18): nn.SpatialBatchNormalization
(19): nn.ReLU
(20): nn.Dropout(0.400000)
(21): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
(22): nn.SpatialBatchNormalization
(23): nn.ReLU
(24): nn.Dropout(0.400000)
(25): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
(26): nn.SpatialBatchNormalization
(27): nn.ReLU
(28): nn.SpatialMaxPooling(2x2, 2,2)
(29): nn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
(30): nn.SpatialBatchNormalization
(31): nn.ReLU
(32): nn.Dropout(0.400000)
(33): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(34): nn.SpatialBatchNormalization
(35): nn.ReLU
(36): nn.Dropout(0.400000)
(37): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(38): nn.SpatialBatchNormalization
(39): nn.ReLU
(40): nn.SpatialMaxPooling(2x2, 2,2)
(41): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(42): nn.SpatialBatchNormalization
(43): nn.ReLU
(44): nn.Dropout(0.400000)
(45): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(46): nn.SpatialBatchNormalization
(47): nn.ReLU
(48): nn.Dropout(0.400000)
(49): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
(50): nn.SpatialBatchNormalization
(51): nn.ReLU
(52): nn.SpatialMaxPooling(2x2, 2,2)
(53): nn.View(512)
(54): nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> output]
(1): nn.Dropout(0.500000)
(2): nn.Linear(512 -> 512)
(3): nn.BatchNormalization
(4): nn.ReLU
(5): nn.Dropout(0.500000)
(6): nn.Linear(512 -> 10)
}
}
}
==> loading data
Will save at logs
==> setting criterion
==> configuring optimizer
==> online epoch # 1 [batchSize = 128]
/Users/artcfa/torch/install/bin/luajit: not enough memory..................] ETA: 30m18s | Step: 5s52ms
The error happens around 10/390, long before the first epoch has finished training.
Hi, i had the same problem on a Macbook Pro when I tried
th train.lua --type=float
The problem is related to luajit, which cannot allocate that much memory. So I installed torch with the following options:
TORCH_LUA_VERSION=LUA51 ./install.sh
which solved the problem. This will result in torch using lua instead of luajit.
nvidia-smi
Sometimes a process may occupy too much GPU memory, you can try to kill this process. I have encountered /usr/lib/xorg/Xorg
occupies around 1800 Mb GPU memory.
I'm following the steps as described in this blog entry in order to run the CIFAR classification. Preprocessing
Provider.lua
works fine, but training won't seem to work due to memory problems.When running the regular command line
CUDA_VISIBLE_DEVICES=0 th train.lua
, I'm getting the following output:So apparently CUDA reports to be out of memory. I've compiled the NVIDIA CUDA samples and ran the
deviceQuery
sample in order to get some stats on this:So I thought maybe 1 GB of total available CUDA memory simply wasn't enough. I modified the sample code so it could run on the CPU without using CUDA (which you can find here) and now it starts up and begins training.
However, after around 9 / 390 training samples, it crashes with the message
luajit not enough memory
, so that doesn't appear to work either.Am I doing something wrong? What can I do to run this?