leovandriel / caffe2_cpp_tutorial

C++ transcripts of the Caffe2 Python tutorials and other C++ example code
BSD 2-Clause "Simplified" License
431 stars 94 forks source link

Fast Retrain Error #42

Open hoebd opened 6 years ago

hoebd commented 6 years ago

Hi, I tried to retrain GoogleNet and tested it with the default images in res/images. When I execute "./bin/train --model googlenet --folder res/images --layer pool5/7x7_s1" I get the following error:

CNN Training Example

E0125 17:12:00.842572 18837 common_gpu.cc:70] Found an unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. I will set the available devices to be zero. optimizer: adam device: cudnn using cuda: true dump-model: false model: googlenet layer: pool5/7x7_s1 image-dir: res/images db-type: leveldb size: 224 iters: 1000 test-runs: 50 batch: 64 lr: 0.0001 display: false reshape: false matrix: false

2 labels found:
0: cat #2 1: dog #2 4 files found split model.. (at pool5/7x7_s1) terminate called after throwing an instance of 'caffe2::EnforceNotMet' what(): [enforce fail at common_gpu.cc:132] error == cudaSuccess. 30 vs 0. Error at: /home/daniel/caffe2/caffe2/core/common_gpu.cc:132: unknown error Aborted at 1516896720 (unix time) try "date -d @1516896720" if you are using GNU date PC: @ 0x7ff6526ad428 gsignal SIGABRT (@0x3e800004995) received by PID 18837 (TID 0x7ff6626eafc0) from PID 18837; stack trace: @ 0x7ff65bef5390 (unknown) @ 0x7ff6526ad428 gsignal @ 0x7ff6526af02a abort @ 0x7ff652ff084d gnu_cxx::verbose_terminate_handler() @ 0x7ff652fee6b6 (unknown) @ 0x7ff652fee701 std::terminate() @ 0x7ff652fee969 cxa_rethrow @ 0x7ff661c0a835 caffe2::CreateOperator() @ 0x7ff661c5f080 caffe2::SimpleNet::SimpleNet() @ 0x7ff661c434d6 caffe2::CreateNet() @ 0x7ff661c43c9d caffe2::CreateNet() @ 0x7ff661be11e2 caffe2::Workspace::RunNetOnce() @ 0x5550bf caffe2::preprocess() @ 0x557d8f caffe2::run() @ 0x559f66 main @ 0x7ff652698830 libc_start_main @ 0x551229 _start @ 0x0 (unknown) Abgebrochen (Speicherabzug geschrieben)

Could someone help me with this error please?

leovandriel commented 6 years ago

hi, a pretrained googlenet model needs to be downloaded first. this requires the cURL library to be installed. On Thu, Jan 25, 2018 at 7:49 AM hoebd notifications@github.com wrote:

Hi, I tried to retrain GoogleNet and tested it with the default images in res/images. When I execute "./bin/train --model googlenet --folder res/images --layer pool5/7x7_s1" I get the following error: CNN Training Example

E0125 16:45:49.228271 18294 common_gpu.cc:70] Found an unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. I will set the available devices to be zero. optimizer: adam device: cudnn using cuda: true dump-model: false model: googlenet layer: pool5/7x7_s1 image-dir: res/images db-type: leveldb size: 224 iters: 1000 test-runs: 50 batch: 64 lr: 0.0001 display: false reshape: false matrix: false

3 labels found: 0: stapler #49 1: cat #42 https://github.com/leonardvandriel/caffe2_cpp_tutorial/issues/42 2: dog #32 https://github.com/leonardvandriel/caffe2_cpp_tutorial/pull/32 123 files found terminate called after throwing an instance of 'caffe2::EnforceNotMet' what(): [enforce fail at keeper.h:123] . model download not supported, install cURL Aborted at 1516895149 (unix time) try "date -d @1516895149" if you are using GNU date PC: @ 0x7f4a2b2a8428 gsignal SIGABRT (@0x3e800004776) received by PID 18294 (TID 0x7f4a3b2e5fc0) from PID 18294; stack trace: @ 0x7f4a34af0390 (unknown) @ 0x7f4a2b2a8428 gsignal @ 0x7f4a2b2aa02a abort @ 0x7f4a2bbeb84d gnu_cxx::verbose_terminate_handler() @ 0x7f4a2bbe96b6 (unknown) @ 0x7f4a2bbe9701 std::terminate() @ 0x7f4a2bbe9919 cxa_throw @ 0x5722fc caffe2::Keeper::download() @ 0x5723c3 caffe2::Keeper::ensureFile() @ 0x5724e4 caffe2::Keeper::ensureModel() @ 0x57254a caffe2::Keeper::addTrainedModel() @ 0x572f22 caffe2::Keeper::AddModel() @ 0x557815 caffe2::run() @ 0x559f66 main @ 0x7f4a2b293830 libc_start_main @ 0x551229 _start @ 0x0 (unknown) Abgebrochen (Speicherabzug geschrieben)

Could someone help me with this error please?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/leonardvandriel/caffe2_cpp_tutorial/issues/42, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS4Xz_9fpxuJJMbhyYEPWEITnmqoJPDks5tOKKOgaJpZM4RtDky .

hoebd commented 6 years ago

Hi, thank you very much for your fast reply. I have already downloaded the googlenet model and curl is installed. Sorry if you have been confused about my first error message with the missing model.

leovandriel commented 6 years ago

This second issue seems to be related to the warning that is displayed at the top ending in I will set the available devices to be zero. Upon running the model, it encounters an error from within the CUDA runtime (see common_gpu.cc). According to the docs error 30 indicates that an unknown internal error has occurred, which is arguably not very helpful. I can't say what the underlying problem is, but I'm fairly sure it's a general issue with your setup, not related to this repo. Did you get any Caffe2 or CUDA demo's to run?

hoebd commented 6 years ago

Yes you are right, it seemed to be a nvidia driver issue. I reinstalled the nvidia driver and successfully rebuilt caffe2, but now I get another error.

## CNN Training Example ##

optimizer: adam
device: cudnn
using cuda: true
dump-model: false
model: googlenet
layer: pool5/7x7_s1
image-dir: res/images
db-type: leveldb
size: 224
iters: 1000
test-runs: 50
batch: 64
lr: 0.0001
display: false
reshape: false
matrix: false

2 labels found:       
  0: cat #2
  1: dog #2
4 files found 
split model.. (at pool5/7x7_s1)
4 images cached                                   

training..
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at tensor.h:671] i < dims_.size(). 0 vs 0. Exceeding ndim limit Error from operator: 
output: "loss3/classifier_w" type: "XavierFill" device_option { device_type: 1 }
** while accessing output: loss3/classifier_w
*** Aborted at 1516970976 (unix time) try "date -d @1516970976" if you are using GNU date ***
PC: @     0x7f05ca754428 gsignal
*** SIGABRT (@0x3e800005ca9) received by PID 23721 (TID 0x7f05da7ecec0) from PID 23721; stack trace: ***
    @     0x7f05d3f9c390 (unknown)
    @     0x7f05ca754428 gsignal
    @     0x7f05ca75602a abort
    @     0x7f05cb09784d __gnu_cxx::__verbose_terminate_handler()
    @     0x7f05cb0956b6 (unknown)
    @     0x7f05cb095701 std::terminate()
    @     0x7f05cb095969 __cxa_rethrow
    @           0x5fb54f caffe2::Operator<>::Run()
    @     0x7f05d9d17378 caffe2::SimpleNet::Run()
    @     0x7f05d9c9c75a caffe2::Workspace::RunNetOnce()
    @           0x551755 caffe2::run_trainer()
    @           0x558cc1 caffe2::run()
    @           0x559f66 main
    @     0x7f05ca73f830 __libc_start_main
    @           0x551229 _start
    @                0x0 (unknown)
leovandriel commented 6 years ago

Thanks for persisting here. This is indeed a bug. I'll take a look today.

leovandriel commented 6 years ago

I pushed a fix in commit 94882795. Let me know if that works.