Open maloletnik opened 7 years ago
I have very similar issue and need help.
I used this repo's script to label and generate lmdb for VOC (20 classes). And use this caffe branch, with #2016 PR changes and updated CRF layers in prototxt file with @bittnt 's comments here.
Then I download pre-trained FCN-8s-model and with NCCL installed, trying to train with 3 GPUs (GeForce GTX 980 Ti/ each with 6G Memory and in total 18GB memory).
It runs well at the beginning but after iteration 0
, it throws error:
I0517 00:29:03.303841 16941 solver.cpp:397] Test net output #5499998: pred = 0.629962
I0517 00:29:03.303849 16941 solver.cpp:397] Test net output #5499999: pred = 0.62284
0
I0517 00:29:24.034217 16941 solver.cpp:218] Iteration 0 (0 iter/s, 4916.08s/500 iters), loss = 211435
I0517 00:29:24.034253 16941 solver.cpp:237] Train net output #0: loss = 211435 (* 1 = 211435 loss)
I0517 00:29:24.034265 16941 sgd_solver.cpp:105] Iteration 0, lr = 1e-13
0
0
0
F0517 00:29:28.590405 16941 modified_permutohedral.cu:437] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x2b4ea8a70daa (unknown)
@ 0x2b4ea8a70ce4 (unknown)
@ 0x2b4ea8a706e6 (unknown)
@ 0x2b4ea8a73687 (unknown)
F0517 00:29:28.597764 16950 modified_permutohedral.cu:437] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x2b4ea74f108e caffe::gpu_compute<>()
@ 0x2b4ea8a70daa (unknown)
@ 0x2b4ea8a70ce4 (unknown)
@ 0x2b4ea74ef308 caffe::ModifiedPermutohedral::compute_gpu()
@ 0x2b4ea8a706e6 (unknown)
@ 0x2b4ea8a73687 (unknown)
@ 0x2b4ea74d8893 caffe::MeanfieldIteration<>::Forward_gpu()
@ 0x2b4ea74f108e caffe::gpu_compute<>()
@ 0x2b4ea74dbd61 caffe::MultiStageMeanfieldLayer<>::Forward_gpu()
@ 0x2b4ea748ff43 caffe::Net<>::ForwardFromTo()
@ 0x2b4ea74ef308 caffe::ModifiedPermutohedral::compute_gpu()
@ 0x2b4ea7490307 caffe::Net<>::Forward()
@ 0x2b4ea74d8893 caffe::MeanfieldIteration<>::Forward_gpu()
@ 0x2b4ea74a66f8 caffe::Solver<>::Step()
@ 0x2b4ea74dbd61 caffe::MultiStageMeanfieldLayer<>::Forward_gpu()
@ 0x2b4ea74a70aa caffe::Solver<>::Solve()
@ 0x2b4ea748ff43 caffe::Net<>::ForwardFromTo()
@ 0x2b4ea7332024 caffe::NCCL<>::Run()
@ 0x40a89f train()
@ 0x40812c main
@ 0x2b4ea7490307 caffe::Net<>::Forward()
@ 0x2b4ea95f5f45 (unknown)
@ 0x408a01 (unknown)
@ 0x2b4ea74a66f8 caffe::Solver<>::Step()
@ (nil) (unknown)
make: *** [train] Aborted (core dumped)
which confused me a lot.
The training initial steps consume a steady memory amount as 11.7GB and I kept my eye on nvidia-smi -l
all the time didn't catch any spikes on memory consumption.
Is the GPU too lame that I need a better one (like 3 GTX 1080 Ti instead)? I feel 18 GB Memory is a lot. Or there are any steps that I was wrong? I would really appreciate if any suggestions can be given. @bittnt
Once I got my training done, I would be happy to open a repo to share all those details.
Reduce the number of layers of CRFs from 10/5 to 2/3.
@bittnt I switched to Tesla K80 GPU and finally got the training running! Thanks for you advice and I will try it out reducing the layers of CRFs.
Also, I organized an updated document for the training process from scratch to include all those tweaks/details in this repo train-crfasrnn. Hope it may help.
@bittnt Please can you elaborate on what you mean by reducing the number of CRF layers and if possible can you show us which part of the code we have to change. Thanks!
Change the number of iterations.
name: "inference1"#if you set name "inference1", code will load parameters from caffemodel.
type: "MultiStageMeanfield"
bottom: "unary"
bottom: "Q0"
bottom: "data"
top: "pred"
param {
lr_mult: 10000#learning rate for W_G
}
param {
lr_mult: 10000#learning rate for W_B
}
param {
lr_mult: 1000 #learning rate for compatiblity transform matrix
}
multi_stage_meanfield_param {
num_iterations: 3 # Change this to reduce number of iterations.
compatibility_mode: POTTS#Initialize the compatilibity transform matrix with a matrix whose diagonal is -1.
threshold: 2
theta_alpha: 160
theta_beta: 3
theta_gamma: 3
spatial_filter_weight: 3
bilateral_filter_weight: 5
}
}```
Hello, @bittnt .I followed your suggestion to change the 'num_iterations' to 3,but still get the weird output. It prints a lot of '0', and when I first make runtest" the latest caffe with crfrnn, it prints a lot of "0" too,during 'runtest' multi_stage_meanfield.cu.
Thanks. It would be good to just comments that line. I have done so. You might need to update the caffe code (probably make clean and remake it).
Hi,@bittnt. I have update the new caffe code, but the "runtest" is still failed and stuck during test2. After I check the info form the console, it appears the Multistagelayer do not pass the gradient check. Do I miss something while doing runtest ?
@bittnt thanks! i reduced the number of iterations as you suggested and everything is working so far. :)
@damiVongola have you passed the caffe 's "runtest"? Since I failed at checking the gradient of Multistagelayer.
Chiming in here to say I also fail runtest
at that layer. It would print a ton of 0's and then apparently hang. I ended up removing the test.
Training models seems to work - I get improved segmentation over my reference methods. Maybe it's a bug in the test itself?
Hi,
I'm using the GPU version (https://github.com/bittnt/caffe.git) to train on 2 classes with examples/segmentationcrfasrnn
Getting weird results before out of memory. (Using K80 on AWS p2 instance).
When the train starts:
And then I get many (MANY) test iterations (5499999 iterations) before out of memory:
The LMDB was made with the script from https://github.com/martinkersner/train-CRF-RNN.git, on my own images and labels (only 74 images, 64 train and 7 for test).
Please advice.
Thanks