val-iisc / expresso

expresso
Other
44 stars 14 forks source link

when I run tutorial-3, It almost 100% done immediately, seems something wrong #8

Open andyyuan78 opened 9 years ago

andyyuan78 commented 9 years ago

and when click the details button, the graph show up , but I can see the 'declined curve'

crazymuse commented 9 years ago

First notification is for beginning of training process . . . I have had video link too . . . you can send me the screenshots of net configuration, etc. You can also do one thing : go to location : $EXPRESSO_ROOT/net/train/nameofnewnet/ folder . . . here you will find shell script which you can very easily debug. This is the script which is run out of the box. try running it. All the things required for the script are in same folder . . . It will help . . . If you find any bug or you have any issue, please inform. asap. I will also look into the issue soon . . .

crazymuse commented 9 years ago

One more thing, be sure you are using Latest version of caffe, along with net . . . as its prototxt format has changed a few months ago . . .

andyyuan78 commented 9 years ago

First, from the video, I see lots of different values in cifar10_quick_train_test.prototxt which I loaded

Second, the result of the script as:

ubgpu@ubgpu:~/source_for_caffe/1/expresso/net/train$ cd untitled/ ubgpu@ubgpu:~/source_for_caffe/1/expresso/net/train/untitled$ sh untitled_trainscript.sh I0615 06:20:26.907968 8381 caffe.cpp:113] Use GPU with device ID 0 I0615 06:20:29.169258 8381 caffe.cpp:121] Starting Optimization I0615 06:20:29.169505 8381 solver.cpp:32] Initializing solver from parameters: test_iter: 100 test_interval: 500 base_lr: 0.001 display: 100 max_iter: 4000 lr_policy: "fixed" momentum: 0.9 weight_decay: 0.004 snapshot: 4000 snapshot_prefix: "/home/ubgpu/github/expresso/net/train/untitled/untitled" solver_mode: GPU net: "/home/ubgpu/github/expresso/net/train/untitled/untitled_train.prototxt" I0615 06:20:29.169667 8381 solver.cpp:70] Creating training net from net file: /home/ubgpu/github/expresso/net/train/untitled/untitled_train.prototxt I0615 06:20:29.170207 8381 net.cpp:287] The NetState phase (0) differed from the phase (1) specified by a rule in layer untitled I0615 06:20:29.170372 8381 net.cpp:42] Initializing net from parameters: name: "CIFAR10_quick_test" input: "data" input_dim: 1 input_dim: 3 input_dim: 32 input_dim: 32 state { phase: TRAIN } layer { name: "untitled" type: "HDF5Data" top: "data" top: "label" include { phase: TRAIN } hdf5_data_param { source: "/home/ubgpu/github/expresso/net/train/untitled/untitled_train.txt" batch_size: 50 } } layer { name: "relu1" type: "ReLU" bottom: "pool1" top: "pool1" } layer { name: "conv2" type: "Convolution" bottom: "pool1" top: "conv2" param { lr_mult: 1 } param { lr_mult: 2 } convolution_param { num_output: 32 pad: 2 kernel_size: 5 stride: 1 } } layer { name: "relu2" type: "ReLU" bottom: "conv2" top: "conv2" } layer { name: "pool2" type: "Pooling" bottom: "conv2" top: "pool2" pooling_param { pool: AVE kernel_size: 3 stride: 2 } } layer { name: "conv3" type: "Convolution" bottom: "pool2" top: "conv3" param { lr_mult: 1 } param { lr_mult: 2 } convolution_param { num_output: 64 pad: 2 kernel_size: 5 stride: 1 } } layer { name: "relu3" type: "ReLU" bottom: "conv3" top: "conv3" } layer { name: "pool3" type: "Pooling" bottom: "conv3" top: "pool3" pooling_param { pool: AVE kernel_size: 3 stride: 2 } } layer { name: "ip1" type: "InnerProduct" bottom: "pool3" top: "ip1" param { lr_mult: 1 } param { lr_mult: 2 } inner_product_param { num_output: 64 } } layer { name: "ip2" type: "InnerProduct" bottom: "ip1" top: "ip2" param { lr_mult: 1 } param { lr_mult: 2 } inner_product_param { num_output: 10 } } layer { name: "prob" type: "Softmax" bottom: "ip2" top: "prob" } F0615 06:20:29.170923 8381 insert_splits.cpp:35] Unknown blob input pool1 to layer 0 * Check failure stack trace: * @ 0x7f45741fb9fd google::LogMessage::Fail() @ 0x7f45741fd89d google::LogMessage::SendToLog() @ 0x7f45741fb5ec google::LogMessage::Flush() @ 0x7f45741fe1be google::LogMessageFatal::~LogMessageFatal() @ 0x7f457462bbae caffe::InsertSplits() @ 0x7f457455f821 caffe::Net<>::Init() @ 0x7f4574561c72 caffe::Net<>::Net() @ 0x7f4574640000 caffe::Solver<>::InitTrainNet() @ 0x7f4574640fd3 caffe::Solver<>::Init() @ 0x7f45746411a6 caffe::Solver<>::Solver() @ 0x40c690 caffe::GetSolver<>() @ 0x406641 train() @ 0x404be1 main @ 0x7f4573712ec5 (unknown) @ 0x40518d (unknown) Aborted (core dumped) Train Completed ubgpu@ubgpu:~/source_for_caffe/1/expresso/net/train/untitled$

crazymuse commented 9 years ago

When I had made the video, the calcuation was taking lower bound instead of upper bound while doing calculation . . . so yes, you will see 16, instead of 15, 8 instead of 7 and so on . . .

Comming back to issue, if you observe there is no layer with bottom :"data", which is generally present in "conv1". In short "conv1" is trimmed Hence "Unknown blob input pool1" error is comming Make sure you have two layers before conv1 . . . for training and validation (even if you are not doing validation). . . can you send me the _train.prototxt . . . in the same folder? as well as cifar train prototxt you are using? Issue is with prototxt file alone not with GUI . . .

On Mon, Jun 15, 2015 at 10:53 PM, Andy Yuan notifications@github.com wrote:

First, from the video, I see lots of different values in cifar10_quick_train_test.prototxt which I loaded

Second, the result of the script as:

ubgpu@ubgpu:~/source_for_caffe/1/expresso/net/train$ cd untitled/ ubgpu@ubgpu:~/source_for_caffe/1/expresso/net/train/untitled$ sh untitled_trainscript.sh I0615 06:20:26.907968 8381 caffe.cpp:113] Use GPU with device ID 0 I0615 06:20:29.169258 8381 caffe.cpp:121] Starting Optimization I0615 06:20:29.169505 8381 solver.cpp:32] Initializing solver from parameters: test_iter: 100 test_interval: 500 base_lr: 0.001 display: 100 max_iter: 4000 lr_policy: "fixed" momentum: 0.9 weight_decay: 0.004 snapshot: 4000 snapshot_prefix: "/home/ubgpu/github/expresso/net/train/untitled/untitled" solver_mode: GPU net: "/home/ubgpu/github/expresso/net/train/untitled/untitled_train.prototxt" I0615 06:20:29.169667 8381 solver.cpp:70] Creating training net from net file: /home/ubgpu/github/expresso/net/train/untitled/untitled_train.prototxt I0615 06:20:29.170207 8381 net.cpp:287] The NetState phase (0) differed from the phase (1) specified by a rule in layer untitled I0615 06:20:29.170372 8381 net.cpp:42] Initializing net from parameters: name: "CIFAR10_quick_test" input: "data" input_dim: 1 input_dim: 3 input_dim: 32 input_dim: 32 state { phase: TRAIN } layer { name: "untitled" type: "HDF5Data" top: "data" top: "label" include { phase: TRAIN } hdf5_data_param { source: "/home/ubgpu/github/expresso/net/train/untitled/untitled_train.txt" batch_size: 50 } } layer { name: "relu1" type: "ReLU" bottom: "pool1" top: "pool1" } layer { name: "conv2" type: "Convolution" bottom: "pool1" top: "conv2" param { lr_mult: 1 } param { lr_mult: 2 } convolution_param { num_output: 32 pad: 2 kernel_size: 5 stride: 1 } } layer { name: "relu2" type: "ReLU" bottom: "conv2" top: "conv2" } layer { name: "pool2" type: "Pooling" bottom: "conv2" top: "pool2" pooling_param { pool: AVE kernel_size: 3 stride: 2 } } layer { name: "conv3" type: "Convolution" bottom: "pool2" top: "conv3" param { lr_mult: 1 } param { lr_mult: 2 } convolution_param { num_output: 64 pad: 2 kernel_size: 5 stride: 1 } } layer { name: "relu3" type: "ReLU" bottom: "conv3" top: "conv3" } layer { name: "pool3" type: "Pooling" bottom: "conv3" top: "pool3" pooling_param { pool: AVE kernel_size: 3 stride: 2 } } layer { name: "ip1" type: "InnerProduct" bottom: "pool3" top: "ip1" param { lr_mult: 1 } param { lr_mult: 2 } inner_product_param { num_output: 64 } } layer { name: "ip2" type: "InnerProduct" bottom: "ip1" top: "ip2" param { lr_mult: 1 } param { lr_mult: 2 } inner_product_param { num_output: 10 } } layer { name: "prob" type: "Softmax" bottom: "ip2" top: "prob" } F0615 06:20:29.170923 8381 insert_splits.cpp:35] Unknown blob input pool1 to layer 0 * Check failure stack trace: * @ 0x7f45741fb9fd google::LogMessage::Fail() @ 0x7f45741fd89d google::LogMessage::SendToLog() @ 0x7f45741fb5ec google::LogMessage::Flush() @ 0x7f45741fe1be google::LogMessageFatal::~LogMessageFatal() @ 0x7f457462bbae caffe::InsertSplits() @ 0x7f457455f821 caffe::Net<>::Init() @ 0x7f4574561c72 caffe::Net<>::Net() @ 0x7f4574640000 caffe::Solver<>::InitTrainNet() @ 0x7f4574640fd3 caffe::Solver<>::Init() @ 0x7f45746411a6 caffe::Solver<>::Solver() @ 0x40c690 caffe::GetSolver<>() @ 0x406641 train() @ 0x404be1 main @ 0x7f4573712ec5 (unknown) @ 0x40518d (unknown) Aborted (core dumped) Train Completed ubgpu@ubgpu:~/source_for_caffe/1/expresso/net/train/untitled$

— Reply to this email directly or view it on GitHub https://github.com/val-iisc/expresso/issues/8#issuecomment-112145993.

crazymuse commented 9 years ago

btw, thanks, I will update the documentation of prototxt file(as input) . . . and try using cifar_quick_train prototxt without tampering or removing data layers . . . I think, it must work . . . .

crazymuse commented 9 years ago

I tried it on my system . . . it works correctly . . . just let me know one thing . . . have you removed any data layer manually in cifar prototxt?

andyyuan78 commented 9 years ago

no change on any files.