Error during training - Githubissues

umariqb commented 8 years ago

Hi,

I have created a lmdb file for MPI dataset without validation data. I followed everything from the readme. However, when I try to train the network I get the following error:

Check failed: shape[i] <= 2147483647 / count_ (960 vs. 383) blob size exceeds INT_MAX

Any idea?

Thanks, Umar

shihenw commented 8 years ago

(1) Can you check if input image size is 368 * 368? (2) What's is your batch size?

umariqb commented 8 years ago

This how the data layer is defined:

layer { name: "data" type: "CPMData" top: "data" top: "label" transform_param { stride: 8 max_rotate_degree: 40 crop_size_x: 368 crop_size_y: 368 scale_prob: 1 scale_min: 0.7 scale_max: 1.3 target_dist: 1.171 center_perterb_max: 0 do_clahe: false } cpmdata_param { source: "lmdb/MPI_train_split" batch_size: 1 backend: LMDB } }

I changed my batch_size to 1, otherwise it was giving me a memory overflow error. I am using Nvidia Titan with 12 GB.

shihenw commented 8 years ago

Hmm on my 12 GB card I can use batch size 8. The prototxt seems right though. Can you post the blob sizes (in 4 dimension) of the layer that complained (or all layers)?

umariqb commented 8 years ago

The network is created successfully. The error comes when the optimization starts. I have attached the output file error.txt

shihenw commented 8 years ago

Then maybe it's related to solver.prototxt. Are you using testing phase?

umariqb commented 8 years ago

No I am trying to train a network without validation images of MPII. Here is the solver.prototxt

pose_solver.txt

umariqb commented 8 years ago

Test phase works perfectly fine. The error appears only during the training.

shihenw commented 8 years ago

Now I guess the reason is that the code has a bug when batch size = 1. Can you try larger batch size like 2 or 4? Should fit into your 12 GB card.

umariqb commented 8 years ago

Worked!!! Thanks a lot. I can use batch size of 10 with 12 GB. Earlier I took a bit too extreme step of using batch size 1, just to make sure if the training starts at all.

Thanks again!

schelian commented 8 years ago

Hi @iqbalu or @shihenw , please tell me which GPU card and CUDA version you are using? I could not train the networks with a Titan Black (GK110B) w/ 6 GB of RAM nor a Quadro M5000 w/ 8 GB of RAM. Now I have a Titan X (1b00) w/ 12 GB of RAM and still can't do it. (I am using CUDA 7.5 on Ubuntu 14.04.5; CUDA 8.0 didn't help either.)

I am using the FLIC dataset and have tried batch_sizes of 2, 4, 8 and 16. I get the error message below which others have fixed by decreasing batch size (link, see ibmua's comment). Any other parameters I can try?

I0916 12:09:14.589750 19086 net.cpp:283] Network initialization done. I0916 12:09:14.590034 19086 solver.cpp:59] Solver scaffolding done. I0916 12:09:14.592664 19086 caffe.cpp:212] Starting Optimization I0916 12:09:14.592681 19086 solver.cpp:287] Solving I0916 12:09:14.592689 19086 solver.cpp:288] Learning Rate Policy: step F0916 12:09:14.684196 19086 math_functions.cu:420] Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE * Check failure stack trace: * @ 0x7f640c456daa (unknown) @ 0x7f640c456ce4 (unknown) @ 0x7f640c4566e6 (unknown) @ 0x7f640c459687 (unknown) @ 0x7f640cbd74b8 caffe::caffe_gpu_rng_uniform() @ 0x7f640cbaf692 caffe::DropoutLayer<>::Forward_gpu() @ 0x7f640cb532f1 caffe::Net<>::ForwardFromTo() @ 0x7f640cb53667 caffe::Net<>::ForwardPrefilled() @ 0x7f640cba3079 caffe::Solver<>::Step() @ 0x7f640cba3a85 caffe::Solver<>::Solve() @ 0x407e53 train() @ 0x405891 main @ 0x7f640b764f45 (unknown) @ 0x405fa1 (unknown) @ (nil) (unknown) Aborted (core dumped)

shihenw / convolutional-pose-machines-release

Error during training #6