train with GPU version - Githubissues

maloletnik commented 7 years ago

Hi,

I'm using the GPU version (https://github.com/bittnt/caffe.git) to train on 2 classes with examples/segmentationcrfasrnn

Getting weird results before out of memory. (Using K80 on AWS p2 instance).

When the train starts:

And then I get many (MANY) test iterations (5499999 iterations) before out of memory:

The LMDB was made with the script from https://github.com/martinkersner/train-CRF-RNN.git, on my own images and labels (only 74 images, 64 train and 7 for test).

Please advice.

Thanks

KleinYuan commented 7 years ago

I have very similar issue and need help.

I used this repo's script to label and generate lmdb for VOC (20 classes). And use this caffe branch, with #2016 PR changes and updated CRF layers in prototxt file with @bittnt 's comments here.

Then I download pre-trained FCN-8s-model and with NCCL installed, trying to train with 3 GPUs (GeForce GTX 980 Ti/ each with 6G Memory and in total 18GB memory).

It runs well at the beginning but after iteration 0, it throws error:

I0517 00:29:03.303841 16941 solver.cpp:397]     Test net output #5499998: pred = 0.629962
I0517 00:29:03.303849 16941 solver.cpp:397]     Test net output #5499999: pred = 0.62284
0
I0517 00:29:24.034217 16941 solver.cpp:218] Iteration 0 (0 iter/s, 4916.08s/500 iters), loss = 211435
I0517 00:29:24.034253 16941 solver.cpp:237]     Train net output #0: loss = 211435 (* 1 = 211435 loss)
I0517 00:29:24.034265 16941 sgd_solver.cpp:105] Iteration 0, lr = 1e-13
0
0
0
F0517 00:29:28.590405 16941 modified_permutohedral.cu:437] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
    @     0x2b4ea8a70daa  (unknown)
    @     0x2b4ea8a70ce4  (unknown)
    @     0x2b4ea8a706e6  (unknown)
    @     0x2b4ea8a73687  (unknown)
F0517 00:29:28.597764 16950 modified_permutohedral.cu:437] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
    @     0x2b4ea74f108e  caffe::gpu_compute<>()
    @     0x2b4ea8a70daa  (unknown)
    @     0x2b4ea8a70ce4  (unknown)
    @     0x2b4ea74ef308  caffe::ModifiedPermutohedral::compute_gpu()
    @     0x2b4ea8a706e6  (unknown)
    @     0x2b4ea8a73687  (unknown)
    @     0x2b4ea74d8893  caffe::MeanfieldIteration<>::Forward_gpu()
    @     0x2b4ea74f108e  caffe::gpu_compute<>()
    @     0x2b4ea74dbd61  caffe::MultiStageMeanfieldLayer<>::Forward_gpu()
    @     0x2b4ea748ff43  caffe::Net<>::ForwardFromTo()
    @     0x2b4ea74ef308  caffe::ModifiedPermutohedral::compute_gpu()
    @     0x2b4ea7490307  caffe::Net<>::Forward()
    @     0x2b4ea74d8893  caffe::MeanfieldIteration<>::Forward_gpu()
    @     0x2b4ea74a66f8  caffe::Solver<>::Step()
    @     0x2b4ea74dbd61  caffe::MultiStageMeanfieldLayer<>::Forward_gpu()
    @     0x2b4ea74a70aa  caffe::Solver<>::Solve()
    @     0x2b4ea748ff43  caffe::Net<>::ForwardFromTo()
    @     0x2b4ea7332024  caffe::NCCL<>::Run()
    @           0x40a89f  train()
    @           0x40812c  main
    @     0x2b4ea7490307  caffe::Net<>::Forward()
    @     0x2b4ea95f5f45  (unknown)
    @           0x408a01  (unknown)
    @     0x2b4ea74a66f8  caffe::Solver<>::Step()
    @              (nil)  (unknown)
make: *** [train] Aborted (core dumped)

which confused me a lot.

The training initial steps consume a steady memory amount as 11.7GB and I kept my eye on nvidia-smi -l all the time didn't catch any spikes on memory consumption.

Is the GPU too lame that I need a better one (like 3 GTX 1080 Ti instead)? I feel 18 GB Memory is a lot. Or there are any steps that I was wrong? I would really appreciate if any suggestions can be given. @bittnt

Once I got my training done, I would be happy to open a repo to share all those details.

bittnt commented 7 years ago

Reduce the number of layers of CRFs from 10/5 to 2/3.

KleinYuan commented 7 years ago

@bittnt I switched to Tesla K80 GPU and finally got the training running! Thanks for you advice and I will try it out reducing the layers of CRFs.

Also, I organized an updated document for the training process from scratch to include all those tweaks/details in this repo train-crfasrnn. Hope it may help.

damiVongola commented 7 years ago

@bittnt Please can you elaborate on what you mean by reducing the number of CRF layers and if possible can you show us which part of the code we have to change. Thanks!

bittnt commented 7 years ago

Change the number of iterations.


  name: "inference1"#if you set name "inference1", code will load parameters from caffemodel.
  type: "MultiStageMeanfield"
  bottom: "unary"
  bottom: "Q0"
  bottom: "data"
  top: "pred"
  param {
    lr_mult: 10000#learning rate for W_G
  }
  param {
  lr_mult: 10000#learning rate for W_B
  }
  param {
  lr_mult: 1000 #learning rate for compatiblity transform matrix
  }
  multi_stage_meanfield_param {
   num_iterations: 3 # Change this to reduce number of iterations.
   compatibility_mode: POTTS#Initialize the compatilibity transform matrix with a matrix whose diagonal is -1.
   threshold: 2
   theta_alpha: 160
   theta_beta: 3
   theta_gamma: 3
   spatial_filter_weight: 3
   bilateral_filter_weight: 5
  }
}```

DavidQiuChao commented 7 years ago

Hello, @bittnt .I followed your suggestion to change the 'num_iterations' to 3,but still get the weird output. It prints a lot of '0', and when I first make runtest" the latest caffe with crfrnn, it prints a lot of "0" too,during 'runtest' multi_stage_meanfield.cu.

bittnt commented 7 years ago

Thanks. It would be good to just comments that line. I have done so. You might need to update the caffe code (probably make clean and remake it).

DavidQiuChao commented 7 years ago

Hi,@bittnt. I have update the new caffe code, but the "runtest" is still failed and stuck during test2. After I check the info form the console, it appears the Multistagelayer do not pass the gradient check. Do I miss something while doing runtest ? screenshot from 2017-06-08 11-20-04

damiVongola commented 7 years ago

@bittnt thanks! i reduced the number of iterations as you suggested and everything is working so far. :)

DavidQiuChao commented 7 years ago

@damiVongola have you passed the caffe 's "runtest"? Since I failed at checking the gradient of Multistagelayer.

nathanin commented 7 years ago

Chiming in here to say I also fail runtest at that layer. It would print a ton of 0's and then apparently hang. I ended up removing the test.

Training models seems to work - I get improved segmentation over my reference methods. Maybe it's a bug in the test itself?

torrvision / crfasrnn

train with GPU version #98