google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.19k stars 721 forks source link

Issue with TPU Node v3-8 in call variants step #537

Closed shishir-reddy closed 2 years ago

shishir-reddy commented 2 years ago

I am trying to use deepvariant to call variants using a TPU Node v3-8, but I am running into a persistent issue.

Here is the command I am using:

docker run \
    -v `pwd`:`pwd` -w `pwd` \
    google/deepvariant:"${BIN_VERSION}" \
    /opt/deepvariant/bin/run_deepvariant \
    --call_variants_extra_args  use_tpu=true,tpu_name="variantcaller-node1",tpu_zone="europe-west4-a" \
    --model_type=WGS \
    --ref="input/data/${REF}" \
    --reads="input/data/${BAM}" \
    --output_vcf="output/${OUTPUT_VCF}" \
    --output_gvcf="output/${OUTPUT_GVCF}" \
    --regions chr20 \
    --num_shards=$(nproc) \
    --intermediate_results_dir /output/intermediate_results_dir

However, I am seeing the following error in the call variants step.

***** Running the command:*****
time /opt/deepvariant/bin/call_variants --outfile "/output/intermediate_results_dir/call_variants_output.tfrecord.gz" --examples "/output/intermediate_results_dir/make_examples.tfrecord@96.gz" --checkpoint "/opt/models/wgs/model.ckpt" --openvino_model_dir "/output/intermediate_results_dir" --tpu_name "variantcaller-node1" --tpu_zone "europe-west4-a" --use_tpu

I0524 21:18:26.485428 140032543119168 transport.py:157] Attempting refresh to obtain initial access_token
I0524 21:18:26.576728 140032543119168 call_variants.py:336] Shape of input examples: [100, 221, 6]
I0524 21:18:26.579230 140032543119168 call_variants.py:361] /opt/models/wgs/model.ckpt.input_shape has the correct shape: [100, 221, 6].
2022-05-24 21:18:26.581705: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-24 21:18:26.586196: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2022-05-24 21:18:26.587127: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000160000 Hz
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmp_f348kd0
W0524 21:18:26.619681 140032543119168 estimator.py:1846] Using temporary folder as model directory: /tmp/tmp_f348kd0
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp_f348kd0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 100000, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.73.74.226:8470', '_evaluation_master': 'grpc://10.73.74.226:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=None, per_host_input_for_training=2, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': None}
I0524 21:18:26.620151 140032543119168 estimator.py:191] Using config: {'_model_dir': '/tmp/tmp_f348kd0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 100000, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.73.74.226:8470', '_evaluation_master': 'grpc://10.73.74.226:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=None, per_host_input_for_training=2, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
I0524 21:18:26.620373 140032543119168 tpu_context.py:271] _TPUContext: eval_on_tpu True
I0524 21:18:26.620768 140032543119168 call_variants.py:426] Writing calls to /output/intermediate_results_dir/call_variants_output.tfrecord.gz
INFO:tensorflow:Querying Tensorflow master (grpc://10.73.74.226:8470) for TPU system metadata.
I0524 21:18:26.625535 140032543119168 tpu_system_metadata.py:90] Querying Tensorflow master (grpc://10.73.74.226:8470) for TPU system metadata.
2022-05-24 21:18:26.626490: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:373] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
INFO:tensorflow:Found TPU system:
I0524 21:18:26.631762 140032543119168 tpu_system_metadata.py:159] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I0524 21:18:26.631872 140032543119168 tpu_system_metadata.py:160] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I0524 21:18:26.631940 140032543119168 tpu_system_metadata.py:161] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I0524 21:18:26.631998 140032543119168 tpu_system_metadata.py:162] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 3314463783741359823)
I0524 21:18:26.632062 140032543119168 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 3314463783741359823)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, -1873770143808342957)
I0524 21:18:26.632296 140032543119168 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, -1873770143808342957)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, -3891821674854936774)
I0524 21:18:26.632360 140032543119168 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, -3891821674854936774)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, -6041584165456864718)
I0524 21:18:26.632421 140032543119168 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, -6041584165456864718)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, -4899456949080638211)
I0524 21:18:26.632479 140032543119168 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, -4899456949080638211)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 6180324062742322030)
I0524 21:18:26.632545 140032543119168 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 6180324062742322030)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, -2652458924365639691)
I0524 21:18:26.632611 140032543119168 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, -2652458924365639691)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 3158275143315040778)
I0524 21:18:26.632669 140032543119168 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 3158275143315040778)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, -4822366763137283978)
I0524 21:18:26.632792 140032543119168 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, -4822366763137283978)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 2291186206241199287)
I0524 21:18:26.632860 140032543119168 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 2291186206241199287)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 7884439564287565365)
I0524 21:18:26.632941 140032543119168 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 7884439564287565365)
INFO:tensorflow:Calling model_fn.
I0524 21:18:26.633588 140032543119168 estimator.py:1162] Calling model_fn.
/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py:1692: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  warnings.warn('`layer.apply` is deprecated and '
INFO:tensorflow:Done calling model_fn.
I0524 21:18:32.742463 140032543119168 estimator.py:1164] Done calling model_fn.
INFO:tensorflow:TPU job name tpu_worker
I0524 21:18:33.019782 140032543119168 tpu_estimator.py:514] TPU job name tpu_worker
INFO:tensorflow:Graph was finalized.
I0524 21:18:33.525068 140032543119168 monitored_session.py:247] Graph was finalized.
INFO:tensorflow:Restoring parameters from /opt/models/wgs/model.ckpt
I0524 21:18:33.525994 140032543119168 saver.py:1298] Restoring parameters from /opt/models/wgs/model.ckpt
INFO:tensorflow:prediction_loop marked as finished
I0524 21:18:34.251420 140032543119168 error_handling.py:115] prediction_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0524 21:18:34.251592 140032543119168 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:tpu_worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to get matching files on /opt/models/wgs/model.ckpt: UNIMPLEMENTED: File system scheme '[local]' not implemented (file: '/opt/models/wgs/model.ckpt')
         [[{{node save_1/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 1303, in restore
    sess.run(self.saver_def.restore_op_name,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 967, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1190, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1368, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:tpu_worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to get matching files on /opt/models/wgs/model.ckpt: UNIMPLEMENTED: File system scheme '[local]' not implemented (file: '/opt/models/wgs/model.ckpt')
         [[node save_1/RestoreV2 (defined at usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py:623) ]]

Original stack trace for 'save_1/RestoreV2':
  File "tmp/Bazel.runfiles_o0nxhusg/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 493, in <module>
    tf.compat.v1.app.run()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "tmp/Bazel.runfiles_o0nxhusg/runfiles/absl_py/absl/app.py", line 299, in run
    _run_main(main, args)
  File "tmp/Bazel.runfiles_o0nxhusg/runfiles/absl_py/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "tmp/Bazel.runfiles_o0nxhusg/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 474, in main
    call_variants(
  File "tmp/Bazel.runfiles_o0nxhusg/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 433, in call_variants
    prediction = next(predictions)
  File "usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3142, in predict
    for result in super(TPUEstimator, self).predict(
  File "usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 623, in predict
    with tf.compat.v1.train.MonitoredSession(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1035, in __init__
    super(MonitoredSession, self).__init__(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 750, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1232, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1237, in _create_session
    return self._sess_creator.create_session()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 903, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 661, in create_session
    self._scaffold.finalize()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 236, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 607, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 836, in __init__
    self.build()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 848, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 876, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 509, in _build_internal
    restore_op = self._AddShardedRestoreOps(filename_tensor, per_device,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 383, in _AddShardedRestoreOps
    self._AddRestoreOps(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 335, in _AddRestoreOps
    all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1490, in restore_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
    ret = Operation(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_o0nxhusg/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 493, in <module>
    tf.compat.v1.app.run()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/tmp/Bazel.runfiles_o0nxhusg/runfiles/absl_py/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/tmp/Bazel.runfiles_o0nxhusg/runfiles/absl_py/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/tmp/Bazel.runfiles_o0nxhusg/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 474, in main
    call_variants(
  File "/tmp/Bazel.runfiles_o0nxhusg/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 433, in call_variants
    prediction = next(predictions)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3153, in predict
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
    six.reraise(typ, value, traceback)
  File "/tmp/Bazel.runfiles_o0nxhusg/runfiles/six_archive/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3142, in predict
    for result in super(TPUEstimator, self).predict(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 623, in predict
    with tf.compat.v1.train.MonitoredSession(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1035, in __init__
    super(MonitoredSession, self).__init__(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 750, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1232, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1237, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 903, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 662, in create_session
    return self._get_session_manager().prepare_session(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/session_manager.py", line 314, in prepare_session
    sess, is_loaded_from_checkpoint = self._restore_checkpoint(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/session_manager.py", line 233, in _restore_checkpoint
    _restore_checkpoint_and_maybe_run_saved_model_initializers(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/session_manager.py", line 71, in _restore_checkpoint_and_maybe_run_saved_model_initializers
    saver.restore(sess, path)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 1339, in restore
    raise _wrap_restore_error_with_msg(
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

From /job:tpu_worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to get matching files on /opt/models/wgs/model.ckpt: UNIMPLEMENTED: File system scheme '[local]' not implemented (file: '/opt/models/wgs/model.ckpt')
         [[node save_1/RestoreV2 (defined at usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py:623) ]]

Original stack trace for 'save_1/RestoreV2':
  File "tmp/Bazel.runfiles_o0nxhusg/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 493, in <module>
    tf.compat.v1.app.run()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "tmp/Bazel.runfiles_o0nxhusg/runfiles/absl_py/absl/app.py", line 299, in run
    _run_main(main, args)
  File "tmp/Bazel.runfiles_o0nxhusg/runfiles/absl_py/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "tmp/Bazel.runfiles_o0nxhusg/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 474, in main
    call_variants(
  File "tmp/Bazel.runfiles_o0nxhusg/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 433, in call_variants
    prediction = next(predictions)
  File "usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3142, in predict
    for result in super(TPUEstimator, self).predict(
  File "usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 623, in predict
    with tf.compat.v1.train.MonitoredSession(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1035, in __init__
    super(MonitoredSession, self).__init__(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 750, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1232, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1237, in _create_session
    return self._sess_creator.create_session()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 903, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 661, in create_session
    self._scaffold.finalize()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 236, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 607, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 836, in __init__
    self.build()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 848, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 876, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 509, in _build_internal
    restore_op = self._AddShardedRestoreOps(filename_tensor, per_device,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 383, in _AddShardedRestoreOps
    self._AddRestoreOps(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 335, in _AddRestoreOps
    all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1490, in restore_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
    ret = Operation(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)

real    0m10.757s
user    0m13.496s
sys     0m5.144s

This same command works fine without using TPUs on this system, and it looks like the TPU node is being recognized by deepvariant. Is there something I'm missing for call_variants?

akolesnikov commented 2 years ago
Failed to get matching files on /opt/models/wgs/model.ckpt: UNIMPLEMENTED: File system scheme '[local]' not implemented (file: '/opt/models/wgs/model.ckpt')

Could you try to save the model on the cloud (path should start with gs://)? It looks that model is not accessible from the TPU host.

shishir-reddy commented 2 years ago

Sure, I am trying to use the default WGS model. After hosting the model on the cloud, how do I point deepvariant to it through the Docker solution?

This is what I see in the local docker container's models directory when running the image:

root@8368b35e9c34:/# ls /opt/models/wgs/
model.ckpt.data-00000-of-00001  model.ckpt.index  model.ckpt.input_shape  model.ckpt.meta

I am using the google/deepvariant:1.3.0 docker image. The same error occurs for me with the GPU version. Is there a different model expected for the TPU implementation?

akolesnikov commented 2 years ago

When you do ls /opt/models/wgs/ you see the local content of the mounted directory which is probably not accessible from TPU host. Although, we don't officially support running on TPU there is an older version case study that shows how to run training on TPU here

In particular, there is a link with instructions how to make storage bucket accessible from the docker.

shishir-reddy commented 2 years ago

Thanks, this makes perfect sense! I did not realize that the hosting the model in Google Storage was necessary for the TPU Node.

I am still having an issue pointing deepvariant to the model hosted in the cloud.

I have tried using a model in the deepvariant bucket with the following command and model: gs://deepvariant/models/DeepVariant/1.3.0/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt.data-00000-of-00001

docker run \
    -v `pwd`:`pwd` -w `pwd` \
    google/deepvariant:"${BIN_VERSION}" \
    /opt/deepvariant/bin/run_deepvariant \
    --call_variants_extra_args  use_tpu=true,tpu_name="variantcaller-node1",tpu_zone="europe-west4-a" \
    --customized_model "gs://deepvariant/models/DeepVariant/1.3.0/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt.data-00000-of-00001" \
    --model_type=WGS \
    --ref="input/data/${REF}" \
    --reads="input/data/${BAM}" \
    --output_vcf="output/${OUTPUT_VCF}" \
    --output_gvcf="output/${OUTPUT_GVCF}" \
    --regions chr20 \
    --num_shards=$(nproc) \
    --intermediate_results_dir /output/intermediate_results_dir

But I get the following error:

I0527 20:42:08.331003 139757477517120 run_deepvariant.py:341] Creating a directory for intermediate results in /output/intermediate_results_dir
Traceback (most recent call last):
  File "/opt/deepvariant/bin/run_deepvariant.py", line 493, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/opt/deepvariant/bin/run_deepvariant.py", line 467, in main
    commands_logfiles = create_all_commands_and_logfiles(intermediate_results_dir)
  File "/opt/deepvariant/bin/run_deepvariant.py", line 382, in create_all_commands_and_logfiles
    check_flags()
  File "/opt/deepvariant/bin/run_deepvariant.py", line 357, in check_flags
    raise RuntimeError('The model files {}* do not exist. Potentially '
RuntimeError: The model files gs://deepvariant/models/DeepVariant/1.3.0/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt.data-00000-of-00001* do not exist. Potentially relevant issue: https://github.com/google/deepvariant/blob/r1.3/docs/FAQ.md#why-cant-it-find-one-of-the-input-files-eg-could-not-open

I also get the same error when hosting the model (renamed model.ckpt) in my personal GS bucket -- I have made the storage bucket read accessible to all users so the TPU should have access:

docker run \
    -v `pwd`:`pwd` -w `pwd` \
    google/deepvariant:"${BIN_VERSION}" \
    /opt/deepvariant/bin/run_deepvariant \
    --call_variants_extra_args  use_tpu=true,tpu_name="variantcaller-node1",tpu_zone="europe-west4-a" \
    --customized_model "gs://tpu-bwb/analysis-files/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt" \
    --model_type=WGS \
    --ref="input/data/${REF}" \
    --reads="input/data/${BAM}" \
    --output_vcf="output/${OUTPUT_VCF}" \
    --output_gvcf="output/${OUTPUT_GVCF}" \
    --regions chr20 \
    --num_shards=$(nproc) \
    --intermediate_results_dir /output/intermediate_results_dir

I0527 21:26:03.381308 140127359940416 run_deepvariant.py:341] Creating a directory for intermediate results in /output/intermediate_results_dir
Traceback (most recent call last):
  File "/opt/deepvariant/bin/run_deepvariant.py", line 493, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/opt/deepvariant/bin/run_deepvariant.py", line 467, in main
    commands_logfiles = create_all_commands_and_logfiles(intermediate_results_dir)
  File "/opt/deepvariant/bin/run_deepvariant.py", line 382, in create_all_commands_and_logfiles
    check_flags()
  File "/opt/deepvariant/bin/run_deepvariant.py", line 357, in check_flags
    raise RuntimeError('The model files {}* do not exist. Potentially '
RuntimeError: The model files gs://tpu-bwb/analysis-files/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt* do not exist. Potentially relevant issue: https://github.com/google/deepvariant/blob/r1.3/docs/FAQ.md#why-cant-it-find-one-of-the-input-files-eg-could-not-open

However, if I shorten the model name in the deepvariant bucket (model.ckpt.data-00000-of-00001 -> model.ckpt), the file is found and processing continues until the previous error is met because the checkpoint file does not actually exist under the name model.ckpt in the deepvariant bucket.

docker run \
    -v `pwd`:`pwd` -w `pwd` \
    google/deepvariant:"${BIN_VERSION}" \
    /opt/deepvariant/bin/run_deepvariant \
    --call_variants_extra_args  use_tpu=true,tpu_name="variantcaller-node1",tpu_zone="europe-west4-a" \
    --customized_model "gs://deepvariant/models/DeepVariant/1.3.0/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt" \
    --model_type=WGS \
    --ref="input/data/${REF}" \
    --reads="input/data/${BAM}" \
    --output_vcf="output/${OUTPUT_VCF}" \
    --output_gvcf="output/${OUTPUT_GVCF}" \
    --regions chr20 \
    --num_shards=$(nproc) \
    --intermediate_results_dir /output/intermediate_results_dir

INFO:tensorflow:Done calling model_fn.
I0527 21:33:10.817516 139926144051008 estimator.py:1164] Done calling model_fn.
INFO:tensorflow:TPU job name tpu_worker
I0527 21:33:11.115715 139926144051008 tpu_estimator.py:514] TPU job name tpu_worker
INFO:tensorflow:Graph was finalized.
I0527 21:33:11.664746 139926144051008 monitored_session.py:247] Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://deepvariant/models/DeepVariant/1.3.0/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt
I0527 21:33:11.801618 139926144051008 saver.py:1298] Restoring parameters from gs://deepvariant/models/DeepVariant/1.3.0/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt
INFO:tensorflow:prediction_loop marked as finished
I0527 21:33:13.662127 139926144051008 error_handling.py:115] prediction_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0527 21:33:13.662372 139926144051008 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.NotFoundError: From /job:tpu_worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://deepvariant/models/DeepVariant/1.3.0/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt
         [[{{node save_1/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 1303, in restore
    sess.run(self.saver_def.restore_op_name,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 967, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1190, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1368, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: From /job:tpu_worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://deepvariant/models/DeepVariant/1.3.0/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt
         [[node save_1/RestoreV2 (defined at usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py:623) ]]

Original stack trace for 'save_1/RestoreV2':
  File "tmp/Bazel.runfiles_2gnuyvf0/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 493, in <module>
    tf.compat.v1.app.run()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "tmp/Bazel.runfiles_2gnuyvf0/runfiles/absl_py/absl/app.py", line 299, in run
    _run_main(main, args)
  File "tmp/Bazel.runfiles_2gnuyvf0/runfiles/absl_py/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "tmp/Bazel.runfiles_2gnuyvf0/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 474, in main
    call_variants(
  File "tmp/Bazel.runfiles_2gnuyvf0/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 433, in call_variants
    prediction = next(predictions)
  File "usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3142, in predict
    for result in super(TPUEstimator, self).predict(
  File "usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 623, in predict
    with tf.compat.v1.train.MonitoredSession(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1035, in __init__
    super(MonitoredSession, self).__init__(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 750, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1232, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1237, in _create_session
    return self._sess_creator.create_session()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 903, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 661, in create_session
    self._scaffold.finalize()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 236, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 607, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 836, in __init__
    self.build()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 848, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 876, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 509, in _build_internal
    restore_op = self._AddShardedRestoreOps(filename_tensor, per_device,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 383, in _AddShardedRestoreOps
    self._AddRestoreOps(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 335, in _AddRestoreOps
    all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1490, in restore_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
    ret = Operation(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line 69, in get_tensor
    return CheckpointReader.CheckpointReader_GetTensor(
RuntimeError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 1314, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 1632, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line 74, in get_tensor
    error_translator(e)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line 35, in error_translator
    raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_2gnuyvf0/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 493, in <module>
    tf.compat.v1.app.run()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/tmp/Bazel.runfiles_2gnuyvf0/runfiles/absl_py/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/tmp/Bazel.runfiles_2gnuyvf0/runfiles/absl_py/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/tmp/Bazel.runfiles_2gnuyvf0/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 474, in main
    call_variants(
  File "/tmp/Bazel.runfiles_2gnuyvf0/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 433, in call_variants
    prediction = next(predictions)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3153, in predict
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
    six.reraise(typ, value, traceback)
  File "/tmp/Bazel.runfiles_2gnuyvf0/runfiles/six_archive/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3142, in predict
    for result in super(TPUEstimator, self).predict(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 623, in predict
    with tf.compat.v1.train.MonitoredSession(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1035, in __init__
    super(MonitoredSession, self).__init__(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 750, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1232, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1237, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 903, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 662, in create_session
    return self._get_session_manager().prepare_session(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/session_manager.py", line 314, in prepare_session
    sess, is_loaded_from_checkpoint = self._restore_checkpoint(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/session_manager.py", line 233, in _restore_checkpoint
    _restore_checkpoint_and_maybe_run_saved_model_initializers(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/session_manager.py", line 71, in _restore_checkpoint_and_maybe_run_saved_model_initializers
    saver.restore(sess, path)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 1319, in restore
    raise _wrap_restore_error_with_msg(
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

From /job:tpu_worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://deepvariant/models/DeepVariant/1.3.0/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt
         [[node save_1/RestoreV2 (defined at usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py:623) ]]

Original stack trace for 'save_1/RestoreV2':
  File "tmp/Bazel.runfiles_2gnuyvf0/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 493, in <module>
    tf.compat.v1.app.run()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "tmp/Bazel.runfiles_2gnuyvf0/runfiles/absl_py/absl/app.py", line 299, in run
    _run_main(main, args)
  File "tmp/Bazel.runfiles_2gnuyvf0/runfiles/absl_py/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "tmp/Bazel.runfiles_2gnuyvf0/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 474, in main
    call_variants(
  File "tmp/Bazel.runfiles_2gnuyvf0/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 433, in call_variants
    prediction = next(predictions)
  File "usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3142, in predict
    for result in super(TPUEstimator, self).predict(
  File "usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 623, in predict
    with tf.compat.v1.train.MonitoredSession(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1035, in __init__
    super(MonitoredSession, self).__init__(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 750, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1232, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1237, in _create_session
    return self._sess_creator.create_session()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 903, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 661, in create_session
    self._scaffold.finalize()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 236, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 607, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 836, in __init__
    self.build()
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 848, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 876, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 509, in _build_internal
    restore_op = self._AddShardedRestoreOps(filename_tensor, per_device,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 383, in _AddShardedRestoreOps
    self._AddRestoreOps(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 335, in _AddRestoreOps
    all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1490, in restore_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
    ret = Operation(
  File "usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)

Is there something simple I am missing here? Thanks for the support.

pichuan commented 2 years ago

@shishir-reddy

Try just:

    --customized_model "gs://deepvariant/models/DeepVariant/1.3.0/DeepVariant-inception_v3-1.3.0+data-wgs_standard/model.ckpt" \
pichuan commented 2 years ago

Sorry, @akolesnikov pointed out that you tried both. I don't have an immediate answer to the second error then.

shishir-reddy commented 2 years ago

Hi, I just wanted to check in to see if there are any updates on this thread? Thanks!

akolesnikov commented 2 years ago

Hi,

(model.ckpt.data-00000-of-00001 -> model.ckpt) is the right way to pass the model. May I ask you a more general question? What is the reason you want to run inference on TPU? In general it is not advisable because TPU processing is way too fast for the inference. The infeed cannot supply examples fast enough.

shishir-reddy commented 2 years ago

I am just benchmarking TPU usage on DeepVariant to see if there is a significant speedup as compared to GPU. There were supporting flags in the call_variants step, so I wanted to test with TPU. If TPU is not recommended for inference, then I will switch over to training and try from there, thanks!

shishir-reddy commented 2 years ago

Is there a solution to the second error that occurs when renaming (model.ckpt.data-00000-of-00001 -> model.ckpt), or is this not supported for TPU usage?

akolesnikov commented 2 years ago

Unfortunately, we don't officially support running on TPU at the moment. The way you ran it when using a short model name looks correct. It could be an access control issue (there is no read access to the bucket containing the model from TPU host).