senlinuc / caffe_ocr

主流ocr算法研究实验性的项目,目前实现了CNN+BLSTM+CTC架构
1.26k stars 535 forks source link

请问支不支持leveldb或者lmdb格式的训练数据? #18

Closed prfans closed 6 years ago

prfans commented 6 years ago

我生成了leveldb格式训练数据,训练过程中,loss除首次正常外后面 总是一个非法数,比如是一个数字除以0这样一个非数字。谢谢。

prfans commented 6 years ago

输出信息如下: I1115 07:27:27.556200 3620 net.cpp:283] Network initialization done. I1115 07:27:27.556200 3620 solver.cpp:60] Solver scaffolding done. I1115 07:27:27.560199 3620 caffe.cpp:266] Starting Optimization I1115 07:27:27.560199 3620 solver.cpp:284] Solving ResNet-18-train I1115 07:27:27.560199 3620 solver.cpp:285] Learning Rate Policy: step I1115 07:27:27.770200 3620 solver.cpp:231] Iteration 0, loss = 50.6965 I1115 07:27:27.771199 3620 solver.cpp:249] Train net output #0: ctcloss = 5 0.6965 ( 1 = 50.6965 loss) I1115 07:27:27.771199 3620 sgd_solver.cpp:106] Iteration 0, lr = 0.0001 I1115 07:27:46.132200 3620 solver.cpp:231] Iteration 200, loss = 1.#QNAN I1115 07:27:46.133199 3620 solver.cpp:249] Train net output #0: ctcloss = - 1.#IND ( 1 = -1.#IND loss) I1115 07:27:46.133199 3620 sgd_solver.cpp:106] Iteration 200, lr = 0.0001 I1115 07:28:04.477200 3620 solver.cpp:231] Iteration 400, loss = 1.#QNAN I1115 07:28:04.477200 3620 solver.cpp:249] Train net output #0: ctcloss = - 1.#IND ( 1 = -1.#IND loss) I1115 07:28:04.478199 3620 sgd_solver.cpp:106] Iteration 400, lr = 0.0001 I1115 07:28:22.871199 3620 solver.cpp:231] Iteration 600, loss = 1.#QNAN I1115 07:28:22.871199 3620 solver.cpp:249] Train net output #0: ctcloss = - 1.#IND ( 1 = -1.#IND loss) I1115 07:28:22.871199 3620 sgd_solver.cpp:106] Iteration 600, lr = 0.0001 I1115 07:28:23.712199 3620 solver.cpp:459] Snapshotting to binary proto file Re sNet-train_val_iter_610.caffemodel I1115 07:28:23.764199 3620 sgd_solver.cpp:273] Snapshotting solver state to bin ary proto file ResNet-train_val_iter_610.solverstate I1115 07:28:23.775199 3620 solver.cpp:306] Optimization stopped early. I1115 07:28:23.775199 3620 caffe.cpp:269] Optimization Done.

senlinuc commented 6 years ago

我用的就是leveldb,出现这样的loss可能是网络或者数据有问题,你可以先检查下数据,网络方面你可以打开debug_info: true,看下哪一层开始出现无效数值的。

prfans commented 6 years ago

caffe.zip

网络就是用ocr目录里面的没有动,数据应该也没有问题,第一次出问题日志,zip文件是完整调试日志: I1115 07:55:39.865200 3496 net.cpp:610] [Forward] Layer lstm2, top blob lstm2 data: 0.902337 I1115 07:55:39.865200 3496 net.cpp:622] [Forward] Layer lstm2, param blob 0 data: 0.0913533 I1115 07:55:39.865200 3496 net.cpp:622] [Forward] Layer lstm2, param blob 1 data: 0.0894198 I1115 07:55:39.866199 3496 net.cpp:622] [Forward] Layer lstm2, param blob 2 data: 0.0458391 I1115 07:55:39.867199 3496 net.cpp:610] [Forward] Layer lstm2-reverse1, top blob rlstm2_input data: 0.894673 I1115 07:55:39.871199 3496 net.cpp:610] [Forward] Layer rlstm2, top blob rlstm2-output data: 0.915383 I1115 07:55:39.871199 3496 net.cpp:622] [Forward] Layer rlstm2, param blob 0 data: 0.0931013 I1115 07:55:39.871199 3496 net.cpp:622] [Forward] Layer rlstm2, param blob 1 data: 0.0917775 I1115 07:55:39.871199 3496 net.cpp:622] [Forward] Layer rlstm2, param blob 2 data: 0.0503878 I1115 07:55:39.872200 3496 net.cpp:610] [Forward] Layer lstm2-reverse2, top blob rlstm2 data: 0.915383 I1115 07:55:39.873199 3496 net.cpp:610] [Forward] Layer blstm2, top blob blstm2 data: 0.90886 I1115 07:55:39.873199 3496 net.cpp:610] [Forward] Layer blstm_sum, top blob blstm_sum data: 1.80353 I1115 07:55:39.874199 3496 net.cpp:610] [Forward] Layer fc1x, top blob fc1x data: 26555.6 I1115 07:55:39.874199 3496 net.cpp:622] [Forward] Layer fc1x, param blob 0 data: 27.1416 I1115 07:55:39.874199 3496 net.cpp:622] [Forward] Layer fc1x, param blob 1 data: 32.0825 I1115 07:55:39.876199 3496 net.cpp:610] [Forward] Layer ctcloss, top blob ctcloss data: 1.#QNAN I1115 07:55:39.876199 3496 net.cpp:638] [Backward] Layer ctcloss, bottom blob fc1x diff: 0.1 I1115 07:55:39.876199 3496 net.cpp:638] [Backward] Layer fc1x, bottom blob blstm_sum diff: 26.8008 I1115 07:55:39.876199 3496 net.cpp:649] [Backward] Layer fc1x, param blob 0 diff: 201.995 I1115 07:55:39.877199 3496 net.cpp:649] [Backward] Layer fc1x, param blob 1 diff: 112

senlinuc commented 6 years ago

layer { name: "fc1x" type: "InnerProduct" bottom: "blstm_sum" top: "fc1x" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } inner_product_param { num_output: 10 weight_filler { type: "xavier" } bias_filler { type: "constant" value: 0 } axis: 2 } }

num_output个数对吗?另外你的blank的label是0吗?

prfans commented 6 years ago

非常感谢,已经解决. 我的每个图像样本固定对应10个字符标签,是不是此处应该设置为类别总数,比如对应A-Z,0-9,应该设置为36?

senlinuc commented 6 years ago

是的,输出应该设置成类别总数+1(包含blank),不是字符个数

prfans commented 6 years ago

ok.

shenjackson commented 5 years ago

F1214 16:20:31.946266 8726 warp_ctc_losslayer.cpp:47] Check failed: N == label_seq->num() (320 vs. 32) Check failure stack trace: @ 0x7fd20dbe3e6d (unknown) @ 0x7fd20dbe5ced (unknown) @ 0x7fd20dbe3a5c (unknown) @ 0x7fd20dbe663e (unknown) @ 0x7fd22559fdbc caffe::WarpCTCLossLayer<>::LayerSetUp() @ 0x7fd2255c189f caffe::Net<>::Init() @ 0x7fd2255c36d2 caffe::Net<>::Net() @ 0x7fd2255cbabd caffe::Solver<>::InitTrainNet() @ 0x7fd2255cbfc3 caffe::Solver<>::Init() @ 0x7fd2255cc27f caffe::Solver<>::Solver() @ 0x7fd2255d87e1 caffe::Creator_NesterovSolver<>() @ 0x40eeb8 caffe::SolverRegistry<>::CreateSolver() @ 0x40a5df train() @ 0x407c6c main @ 0x7fd1f7f7c445 __libc_start_main @ 0x4084a3 (unknown)

你好,请问这个错误怎么解决呢