haotianteng / Chiron

A basecaller for Oxford Nanopore Technologies' sequencers
Other
122 stars 53 forks source link

Unable to basecall with non-default model #19

Closed paru16 closed 6 years ago

paru16 commented 6 years ago

Hello,

I’ve been trying to train Chiron (v0.3, GPU) with a custom dataset. I’ve created a model using chiron_rcnn_train.py (without any apparent issues), but basecalling against this model has been failing with the following error:

NotFoundError (see above for traceback): Key BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/biases not found in checkpoint [[Node: save/RestoreV2_2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_2/tensor_names, save/RestoreV2_2/shape_and_slices)]] [[Node: save/RestoreV2_35/_49 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_195_save/RestoreV2_35", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

I’ve examined the TF checkpoint file for the created model (with inspect_checkpoint.py) and it’s at odds with the provided DNA_default model. It’s missing the following fields: BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/biases (DT_FLOAT)[400] BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/weights (DT_FLOAT) [356,400] BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/weights/Adam (DT_FLOAT) [356,400] BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/weights/Adam_1 (DT_FLOAT) [356,400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/biases (DT_FLOAT) [400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/weights (DT_FLOAT) [356,400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/weights/Adam (DT_FLOAT) [356,400] BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/weights/Adam_1 (DT_FLOAT) [356,400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/biases (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/weights (DT_FLOAT) [300,400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/weights/Adam (DT_FLOAT) [300,400] BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/weights/Adam_1 (DT_FLOAT) [300,400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/biases (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/weights (DT_FLOAT) [300,400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/weights/Adam (DT_FLOAT) [300,400] BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/weights/Adam_1 (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/biases (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/weights (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/weights/Adam (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/weights/Adam_1 (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/biases (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/biases/Adam (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/biases/Adam_1 (DT_FLOAT) [400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights/Adam (DT_FLOAT) [300,400] BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights/Adam_1 (DT_FLOAT) [300,400] rnn_fnn_layer/bias (DT_FLOAT) [100] rnn_fnn_layer/bias/Adam (DT_FLOAT) [100] rnn_fnn_layer/bias/Adam_1 (DT_FLOAT) [100] rnn_fnn_layer/bias_class (DT_FLOAT) [5] rnn_fnn_layer/bias_class/Adam (DT_FLOAT) [5] rnn_fnn_layer/bias_class/Adam_1 (DT_FLOAT) [5] rnn_fnn_layer/weights (DT_FLOAT) [2,100] rnn_fnn_layer/weights/Adam (DT_FLOAT) [2,100] rnn_fnn_layer/weights/Adam_1 (DT_FLOAT) [2,100] rnn_fnn_layer/weights_class (DT_FLOAT) [100,5] rnn_fnn_layer/weights_class/Adam (DT_FLOAT) [100,5] rnn_fnn_layer/weights_class/Adam_1 (DT_FLOAT) [100,5] Instead containing: global_step (DT_INT32) [] logit_bias (DT_FLOAT) [5] logit_bias/Adam (DT_FLOAT) [5] logit_bias/Adam_1 (DT_FLOAT) [5] logit_weights (DT_FLOAT) [256,5] logit_weights/Adam (DT_FLOAT) [256,5] logit_weights/Adam_1 (DT_FLOAT) [256,5] Any advice would be appreciated.

Thanks!

haotianteng commented 6 years ago

What's your output when you using chiron_rcnn_train.py? It usually take about ~ 1day to train an usable model, how many steps have you trained? What is the error rate?

paru16 commented 6 years ago

Hello,

The final checkpoint file looks like this: model_checkpoint_path: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/final.ckpt-20000" all_model_checkpoint_paths: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19961" all_model_checkpoint_paths: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19971" all_model_checkpoint_paths: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19981" all_model_checkpoint_paths: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19991" all_model_checkpoint_paths: "/stornext/HPCScratch/home/chiron_bb12_cnp_model/bb12_cnp/final.ckpt-20000"

It took ~18 hours to run, and I’ve trained 20000 steps. I’m not sure exactly what you mean by the error rate, but I’ve attached the stdout file.

Thanks for the quick response!

chiron_output.txt

haotianteng commented 6 years ago

From the output I think the training is succeeded. Can you try to modify the checkpoint file to the following and try basecall again?

Change the checkpoint file to:

model_checkpoint_path: "final.ckpt-20000"

Also, which Tensorflow version are you used?

Teng

2018-02-19 16:30 GMT+10:00 paru16 notifications@github.com:

Hello,

The final checkpoint file looks like this: model_checkpoint_path: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/final.ckpt-20000" all_model_checkpoint_paths: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19961" all_model_checkpoint_paths: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19971" all_model_checkpoint_paths: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19981" all_model_checkpoint_paths: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/model.ckpt-19991" all_model_checkpoint_paths: "/stornext/HPCScratch/home/ chiron_bb12_cnp_model/bb12_cnp/final.ckpt-20000"

It took ~18 hours to run, and I’ve trained 20000 steps. I’m not sure exactly what you mean by the error rate, but I’ve attached the stdout file.

Thanks for the quick response!

chiron_output.txt https://github.com/haotianteng/Chiron/files/1735722/chiron_output.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haotianteng/Chiron/issues/19#issuecomment-366599645, or mute the thread https://github.com/notifications/unsubscribe-auth/AKo3X7OdEJdRP3Fl21lGh5ANcC0_H_gLks5tWRUfgaJpZM4SKAol .

-- Teng Haotian University of Queensland, Queensland, Australia +61 0426116017

paru16 commented 6 years ago

Hello,

I've been using tensorflow-gpu version 1.0.1. I haven't had any luck changing the checkpoint file. I'm still getting the same sort of error, the traceback is below:

`W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/biases not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/weights not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_2/bidirectional_rnn/bw/lstm_cell/biases not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/weights not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_1/bidirectional_rnn/fw/lstm_cell/biases not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/weights not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_1/bidirectional_rnn/bw/lstm_cell/biases not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/weights not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_0/bidirectional_rnn/fw/lstm_cell/biases not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/weights not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key BDLSTM_rnn/cell_0/bidirectional_rnn/bw/lstm_cell/biases not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key rnn_fnn_layer/bias_class not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key rnn_fnn_layer/weights not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key rnn_fnn_layer/bias not found in checkpoint W tensorflow/core/framework/op_kernel.cc:993] Not found: Key rnn_fnn_layer/weights_class not found in checkpoint Traceback (most recent call last): File "/home/Chiron-0.3/chiron/entry.py", line 66, in main()
File "/home/Chiron-0.3/chiron/entry.py", line 63, in main args.func(args) File "/home/Chiron-0.3/chiron/entry.py", line 19, in evaluation chiron_eval.run(args) File "/stornext/HPCScratch/home/Chiron-0.3/chiron/chiron_eval.py", line 188, in run time_dict=unix_time(evaluation) File "/stornext/HPCScratch/home/Chiron-0.3/chiron/utils/unix_time.py", line 23, in unix_time function(*args, **kwargs) File "/stornext/HPCScratch/home/Chiron-0.3/chiron/chiron_eval.py", line 119, in evaluation saver.restore(sess,tf.train.latest_checkpoint(FLAGS.model)) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1428, in restore {self.saver_def.filename_tensor_name: save_path}) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: Key BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights not found in checkpoint [[Node: save/RestoreV2_11 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_11/tensor_names, save/RestoreV2_11/shape_and_slices)]]

Caused by op u'save/RestoreV2_11', defined at: File "/home/Chiron-0.3/chiron/entry.py", line 66, in main() File "/home/Chiron-0.3/chiron/entry.py", line 63, in main args.func(args) File "/home/Chiron-0.3/chiron/entry.py", line 19, in evaluation chiron_eval.run(args) File "/stornext/HPCScratch/home/Chiron-0.3/chiron/chiron_eval.py", line 188, in run time_dict=unix_time(evaluation) File "/stornext/HPCScratch/home/Chiron-0.3/chiron/utils/unix_time.py", line 23, in unix_time function(*args, **kwargs) File "/stornext/HPCScratch/home/Chiron-0.3/chiron/chiron_eval.py", line 118, in evaluation saver = tf.train.Saver() File "/home/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1040, in init self.build() File "/home/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1070, in build restore_sequentially=self._restore_sequentially) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 675, in build restore_sequentially, reshape) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 402, in _AddRestoreOps tensors = self.restore_op(filename_tensor, saveable, preferred_shard) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 242, in restore_op [spec.tensor.dtype])[0]) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 668, in restore_v2 dtypes=dtypes, name=name) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1226, in init self._traceback = _extract_stack()

NotFoundError (see above for traceback): Key BDLSTM_rnn/cell_2/bidirectional_rnn/fw/lstm_cell/weights not found in checkpoint [[Node: save/RestoreV2_11 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_11/tensor_names, save/RestoreV2_11/shape_and_slices)]]`

Thanks for your help!

haotianteng commented 6 years ago

What's the command you used to use the customized model to basecall? And would you mind attaching your saved model for me to check? There should be 3 files for each step in your model folder: final.ckpt-100000.data-00000-of-00001 final.ckpt-100000.index final.ckpt-100000.meta

haotianteng commented 6 years ago

Problem found, this is due to the inconsistency between chiron_eval.py and chiron_rcnn_train.py. So go to chiron_eval.py comment Line#23 and uncomment the Line#25. This will make Chiron call using only CNN with your current model.

Or go to the chiron_rcnn_train.py, comment Line#25 and uncomment Line#23 and train your model again. And then it should be fine to work.

I will release a fix for this bug, but the above manually fix should work.

paru16 commented 6 years ago

Yes, that works if 'from cnn import getcnnlogit' is also added to chiron_eval.py.

Thanks so much for your help!

haotianteng commented 6 years ago

getcnnlogits will get the logits directly from the CNN, so the RNN is not used. For a better performance, I suggest to use the second solution.