Closed paru16 closed 6 years ago
What's your output when you using chiron_rcnn_train.py? It usually take about ~ 1day to train an usable model, how many steps have you trained? What is the error rate?
Hello,
The final checkpoint file looks like this: model_checkpoint_path: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/final.ckpt-20000" all_model_checkpoint_paths: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19961" all_model_checkpoint_paths: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19971" all_model_checkpoint_paths: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19981" all_model_checkpoint_paths: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19991" all_model_checkpoint_paths: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/final.ckpt-20000"
It took ~18 hours to run, and I’ve trained 20000 steps. I’m not sure exactly what you mean by the error rate, but I’ve attached the stdout file.
Thanks for the quick response!
From the output I think the training is succeeded. Can you try to modify the checkpoint file to the following and try basecall again?
Change the checkpoint file to:
model_checkpoint_path: "final.ckpt-20000"
Also, which Tensorflow version are you used?
Teng
2018-02-19 16:30 GMT+10:00 paru16 notifications@github.com:
Hello,
The final checkpoint file looks like this: model_checkpoint_path: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/final.ckpt-20000" all_model_checkpoint_paths: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19961" all_model_checkpoint_paths: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19971" all_model_checkpoint_paths: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19981" all_model_checkpoint_paths: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19991" all_model_checkpoint_paths: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/final.ckpt-20000"
It took ~18 hours to run, and I’ve trained 20000 steps. I’m not sure exactly what you mean by the error rate, but I’ve attached the stdout file.
Thanks for the quick response!
chiron_output.txt https://github.com/haotianteng/Chiron/files/1735722/chiron_output.txt
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haotianteng/Chiron/issues/19#issuecomment-366599645, or mute the thread https://github.com/notifications/unsubscribe-auth/AKo3X7OdEJdRP3Fl21lGh5ANcC0_H_gLks5tWRUfgaJpZM4SKAol .
-- Teng Haotian University of Queensland, Queensland, Australia +61 0426116017
Hello,
I've been using tensorflow-gpu version 1.0.1. I haven't had any luck changing the checkpoint file. I'm still getting the same sort of error, the traceback is below:
`W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/biases not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/weights not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/biases not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/weights not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/biases not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/weights not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/biases not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/weights not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/biases not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/weights not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/biases not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key rnn_fnn_layer/bias_class not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key rnn_fnn_layer/weights not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key rnn_fnn_layer/bias not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key rnn_fnn_layer/weights_class not found in checkpoint
Traceback (most recent call last):
File "/home/Chiron-0.3/chiron/entry.py", line 66, in
File "/home/Chiron-0.3/chiron/entry.py", line 63, in main
args.func(args)
File "/home/Chiron-0.3/chiron/entry.py", line 19, in evaluation
chiron_eval.run(args)
File "/stornext/HPCScratch/home/Chiron-0.3/chiron/chiron_eval.py", line 188, in run
time_dict=unix_time(evaluation)
File "/stornext/HPCScratch/home/Chiron-0.3/chiron/utils/unix_time.py", line 23, in unix_time
function(*args, **kwargs)
File "/stornext/HPCScratch/home/Chiron-0.3/chiron/chiron_eval.py", line 119, in evaluation
saver.restore(sess,tf.train.latest_checkpoint(FLAGS.model))
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1428, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights not found in checkpoint
[[Node: save/RestoreV2_11 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_11/tensor_names, save/RestoreV2_11/shape_and_slices)]]
Caused by op u'save/RestoreV2_11', defined at:
File "/home/Chiron-0.3/chiron/entry.py", line 66, in
NotFoundError (see above for traceback): Key BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights not found in checkpoint [[Node: save/RestoreV2_11 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_11/tensor_names, save/RestoreV2_11/shape_and_slices)]]`
Thanks for your help!
What's the command you used to use the customized model to basecall? And would you mind attaching your saved model for me to check? There should be 3 files for each step in your model folder: final.ckpt-100000.data-00000-of-00001 final.ckpt-100000.index final.ckpt-100000.meta
Problem found, this is due to the inconsistency between chiron_eval.py and chiron_rcnn_train.py. So go to chiron_eval.py comment Line#23 and uncomment the Line#25. This will make Chiron call using only CNN with your current model.
Or go to the chiron_rcnn_train.py, comment Line#25 and uncomment Line#23 and train your model again. And then it should be fine to work.
I will release a fix for this bug, but the above manually fix should work.
Yes, that works if 'from cnn import getcnnlogit' is also added to chiron_eval.py.
Thanks so much for your help!
getcnnlogits will get the logits directly from the CNN, so the RNN is not used. For a better performance, I suggest to use the second solution.
Hello,
I’ve been trying to train Chiron (v0.3, GPU) with a custom dataset. I’ve created a model using chiron_rcnn_train.py (without any apparent issues), but basecalling against this model has been failing with the following error:
NotFoundError (see above for traceback): Key BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/biases not found in checkpoint [[Node: save/RestoreV2_2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_2/tensor_names, save/RestoreV2_2/shape_and_slices)]] [[Node: save/RestoreV2_35/_49 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_195_save/RestoreV2_35", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
I’ve examined the TF checkpoint file for the created model (with inspect_checkpoint.py) and it’s at odds with the provided DNA_default model. It’s missing the following fields:
BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/biases (DT_FLOAT)[400] BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/weights (DT_FLOAT) [356,400] BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/weights/Adam (DT_FLOAT) [356,400] BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/weights/Adam_1 (DT_FLOAT) [356,400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/biases (DT_FLOAT) [400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/weights (DT_FLOAT) [356,400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/weights/Adam (DT_FLOAT) [356,400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/weights/Adam_1 (DT_FLOAT) [356,400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/biases (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/weights (DT_FLOAT) [300,400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/weights/Adam (DT_FLOAT) [300,400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/weights/Adam_1 (DT_FLOAT) [300,400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/biases (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/weights (DT_FLOAT) [300,400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/weights/Adam (DT_FLOAT) [300,400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/weights/Adam_1 (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/biases (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/weights (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/weights/Adam (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/weights/Adam_1 (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/biases (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights/Adam (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights/Adam_1 (DT_FLOAT) [300,400] rnn_fnn_layer/bias (DT_FLOAT) [100] rnn_fnn_layer/bias/Adam (DT_FLOAT) [100] rnn_fnn_layer/bias/Adam_1 (DT_FLOAT) [100] rnn_fnn_layer/bias_class (DT_FLOAT) [5] rnn_fnn_layer/bias_class/Adam (DT_FLOAT) [5] rnn_fnn_layer/bias_class/Adam_1 (DT_FLOAT) [5] rnn_fnn_layer/weights (DT_FLOAT) [2,100] rnn_fnn_layer/weights/Adam (DT_FLOAT) [2,100] rnn_fnn_layer/weights/Adam_1 (DT_FLOAT) [2,100] rnn_fnn_layer/weights_class (DT_FLOAT) [100,5] rnn_fnn_layer/weights_class/Adam (DT_FLOAT) [100,5] rnn_fnn_layer/weights_class/Adam_1 (DT_FLOAT) [100,5]
Instead containing:global_step (DT_INT32) [] logit_bias (DT_FLOAT) [5] logit_bias/Adam (DT_FLOAT) [5] logit_bias/Adam_1 (DT_FLOAT) [5] logit_weights (DT_FLOAT) [256,5] logit_weights/Adam (DT_FLOAT) [256,5] logit_weights/Adam_1 (DT_FLOAT) [256,5]
Any advice would be appreciated.Thanks!