Cannot run main_infer.py in the demo, can anyone help me?

mvoofan commented 2 weeks ago

When I tried to run "python main_infer.py --model_name RAN4 --data_name unpaired_ct_abdomen" in my V100 card. I encoutered the " OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key conv3d_33/bias not found in checkpoint" problem when restoring the checkpoint. The problem may be caused by the different naming strategy of the network layers in different computer.

The full output of the command line is listed as follows:

2024-08-30 15:40:48.346070: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 /opt/anaconda3/envs/ran/lib/python3.8/site-packages/scipy/init.py:143: UserWarning: A NumPy version >=1.19.5 and <1.27.0 is required for this version of SciPy (detected version 1.18.5) warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" WARNING:tensorflow:From /opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term RAN4 [0 0 0] /data2/rf/res_aligner_net_copy/data/unpaired_ct_abdomen/dataset WARNING:tensorflow:From /data2/rf/res_aligner_net_copy/external/neuron/layers.py:170: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead (?, 48, 40, 64, 8) (?, 48, 40, 64, 16) (?, 24, 20, 32, 8) (?, 48, 40, 64, 16) (?, 12, 10, 16, 16) (?, 48, 40, 64, 16) (?, 6, 5, 8, 32) (?, 48, 40, 64, 16) 2024-08-30 15:40:52.147142: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2024-08-30 15:40:52.384947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.74GiB deviceMemoryBandwidth: 836.37GiB/s 2024-08-30 15:40:52.385033: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2024-08-30 15:40:52.387390: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2024-08-30 15:40:52.389578: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2024-08-30 15:40:52.389981: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2024-08-30 15:40:52.392277: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2024-08-30 15:40:52.393630: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2024-08-30 15:40:52.398720: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2024-08-30 15:40:52.399351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2024-08-30 15:40:52.400357: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-30 15:40:52.430223: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2500000000 Hz 2024-08-30 15:40:52.437139: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x477e9d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2024-08-30 15:40:52.437198: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2024-08-30 15:40:52.525244: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x475e990 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2024-08-30 15:40:52.525322: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0 2024-08-30 15:40:52.526402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.74GiB deviceMemoryBandwidth: 836.37GiB/s 2024-08-30 15:40:52.526490: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2024-08-30 15:40:52.526550: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2024-08-30 15:40:52.526590: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2024-08-30 15:40:52.526627: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2024-08-30 15:40:52.526665: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2024-08-30 15:40:52.526702: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2024-08-30 15:40:52.526740: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2024-08-30 15:40:52.527895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2024-08-30 15:40:52.527977: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2024-08-30 15:40:53.195412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2024-08-30 15:40:53.195455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 2024-08-30 15:40:53.195464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N 2024-08-30 15:40:53.196716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30125 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0) 2024-08-30 15:42:28.893544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.74GiB deviceMemoryBandwidth: 836.37GiB/s 2024-08-30 15:42:28.893625: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2024-08-30 15:42:28.893697: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2024-08-30 15:42:28.893729: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2024-08-30 15:42:28.893752: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2024-08-30 15:42:28.893772: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2024-08-30 15:42:28.893794: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2024-08-30 15:42:28.893819: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2024-08-30 15:42:28.894349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2024-08-30 15:42:28.894396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2024-08-30 15:42:28.894406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 2024-08-30 15:42:28.894415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N 2024-08-30 15:42:28.894992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30125 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0) ./models/unpaired_ct_abdomen/unpaired_ct_abdomen-RAN4/model_3.tf 2024-08-30 15:42:30.034069: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key conv3d_33/bias not found in checkpoint Traceback (most recent call last): File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1349, in _run_fn return self._call_tf_sessionrun(options, feed_dict, fetch_list, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1441, in _call_tf_sessionrun return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key conv3d_33/bias not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2/_301]] (1) Not found: Key conv3d_33/bias not found in checkpoint [[{{node save/RestoreV2}}]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1298, in restore sess.run(self.saver_def.restore_op_name, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 957, in run result = self._run(None, fetches, feed_dict, options_ptr, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1180, in _run results = self._do_run(handle, final_targets, final_fetches, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1358, in _do_run return self._do_call(_run_fn, feeds, fetches, targets, options, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key conv3d_33/bias not found in checkpoint [[node save/RestoreV2 (defined at /data2/rf/res_aligner_net_copy/infer_tfkeras.py:168) ]] [[save/RestoreV2/_301]] (1) Not found: Key conv3d_33/bias not found in checkpoint [[node save/RestoreV2 (defined at /data2/rf/res_aligner_net_copy/infer_tfkeras.py:168) ]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'save/RestoreV2': File "main_infer.py", line 74, in infer(net_core=net_core,model_path=model_path,crop_sz=crop_sz,pair_type="unpaired",rescale_factor=rescale_factor,rescale_factor_label=rescale_factor_label,use_lab=use_lab,test_path=test_paths,model_name=model_name,int_range=int_range) File "/data2/rf/res_aligner_net_copy/infer_tfkeras.py", line 168, in infer saver = tf.train.Saver(max_to_keep=1) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 836, in init self.build() File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 848, in build self._build(self._filename, build_save=True, build_restore=True) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 876, in _build self.saver_def = self._builder._build_internal( # pylint: disable=protected-access File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 515, in _build_internal restore_op = self._AddRestoreOps(filename_tensor, saveables, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 335, in _AddRestoreOps all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1521, in restorev2 , _, _op, _outputs = _op_def_library._apply_op_helper( File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 742, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3477, in _create_op_internal ret = Operation( File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1949, in init self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 69, in get_tensor return CheckpointReader.CheckpointReader_GetTensor( RuntimeError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1309, in restore names_to_keys = object_graph_key_mapping(save_path) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1627, in object_graph_key_mapping object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 74, in get_tensor error_translator(e) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 35, in error_translator raise errors_impl.NotFoundError(None, None, error_message) tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main_infer.py", line 74, in infer(net_core=net_core,model_path=model_path,crop_sz=crop_sz,pair_type="unpaired",rescale_factor=rescale_factor,rescale_factor_label=rescale_factor_label,use_lab=use_lab,test_path=test_paths,model_name=model_name,int_range=int_range) File "/data2/rf/res_aligner_net_copy/infer_tfkeras.py", line 176, in infer saver.restore(sess, save_path, ) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1314, in restore raise _wrap_restore_error_with_msg( tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found. (0) Not found: Key conv3d_33/bias not found in checkpoint [[node save/RestoreV2 (defined at /data2/rf/res_aligner_net_copy/infer_tfkeras.py:168) ]] [[save/RestoreV2/_301]] (1) Not found: Key conv3d_33/bias not found in checkpoint [[node save/RestoreV2 (defined at /data2/rf/res_aligner_net_copy/infer_tfkeras.py:168) ]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'save/RestoreV2': File "main_infer.py", line 74, in infer(net_core=net_core,model_path=model_path,crop_sz=crop_sz,pair_type="unpaired",rescale_factor=rescale_factor,rescale_factor_label=rescale_factor_label,use_lab=use_lab,test_path=test_paths,model_name=model_name,int_range=int_range) File "/data2/rf/res_aligner_net_copy/infer_tfkeras.py", line 168, in infer saver = tf.train.Saver(max_to_keep=1) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 836, in init self.build() File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 848, in build self._build(self._filename, build_save=True, build_restore=True) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 876, in _build self.saver_def = self._builder._build_internal( # pylint: disable=protected-access File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 515, in _build_internal restore_op = self._AddRestoreOps(filename_tensor, saveables, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 335, in _AddRestoreOps all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1521, in restorev2 , _, _op, _outputs = _op_def_library._apply_op_helper( File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 742, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3477, in _create_op_internal ret = Operation( File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1949, in init

jianqingzheng commented 2 weeks ago

When I tried to run "python main_infer.py --model_name RAN4 --data_name unpaired_ct_abdomen" in my V100 card. I encoutered the " OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key conv3d_33/bias not found in checkpoint" problem when restoring the checkpoint. The problem may be caused by the different naming strategy of the network layers in different computer.

The full output of the command line is listed as follows:

2024-08-30 15:40:48.346070: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 /opt/anaconda3/envs/ran/lib/python3.8/site-packages/scipy/init.py:143: UserWarning: A NumPy version >=1.19.5 and <1.27.0 is required for this version of SciPy (detected version 1.18.5) warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" WARNING:tensorflow:From /opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term RAN4 [0 0 0] /data2/rf/res_aligner_net_copy/data/unpaired_ct_abdomen/dataset WARNING:tensorflow:From /data2/rf/res_aligner_net_copy/external/neuron/layers.py:170: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead (?, 48, 40, 64, 8) (?, 48, 40, 64, 16) (?, 24, 20, 32, 8) (?, 48, 40, 64, 16) (?, 12, 10, 16, 16) (?, 48, 40, 64, 16) (?, 6, 5, 8, 32) (?, 48, 40, 64, 16) 2024-08-30 15:40:52.147142: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2024-08-30 15:40:52.384947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.74GiB deviceMemoryBandwidth: 836.37GiB/s 2024-08-30 15:40:52.385033: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2024-08-30 15:40:52.387390: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2024-08-30 15:40:52.389578: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2024-08-30 15:40:52.389981: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2024-08-30 15:40:52.392277: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2024-08-30 15:40:52.393630: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2024-08-30 15:40:52.398720: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2024-08-30 15:40:52.399351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2024-08-30 15:40:52.400357: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-30 15:40:52.430223: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2500000000 Hz 2024-08-30 15:40:52.437139: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x477e9d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2024-08-30 15:40:52.437198: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2024-08-30 15:40:52.525244: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x475e990 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2024-08-30 15:40:52.525322: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0 2024-08-30 15:40:52.526402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.74GiB deviceMemoryBandwidth: 836.37GiB/s 2024-08-30 15:40:52.526490: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2024-08-30 15:40:52.526550: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2024-08-30 15:40:52.526590: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2024-08-30 15:40:52.526627: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2024-08-30 15:40:52.526665: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2024-08-30 15:40:52.526702: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2024-08-30 15:40:52.526740: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2024-08-30 15:40:52.527895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2024-08-30 15:40:52.527977: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2024-08-30 15:40:53.195412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2024-08-30 15:40:53.195455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 2024-08-30 15:40:53.195464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N 2024-08-30 15:40:53.196716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30125 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0) 2024-08-30 15:42:28.893544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.74GiB deviceMemoryBandwidth: 836.37GiB/s 2024-08-30 15:42:28.893625: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2024-08-30 15:42:28.893697: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2024-08-30 15:42:28.893729: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2024-08-30 15:42:28.893752: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2024-08-30 15:42:28.893772: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2024-08-30 15:42:28.893794: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2024-08-30 15:42:28.893819: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2024-08-30 15:42:28.894349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2024-08-30 15:42:28.894396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2024-08-30 15:42:28.894406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 2024-08-30 15:42:28.894415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N 2024-08-30 15:42:28.894992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30125 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0) ./models/unpaired_ct_abdomen/unpaired_ct_abdomen-RAN4/model_3.tf 2024-08-30 15:42:30.034069: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key conv3d_33/bias not found in checkpoint Traceback (most recent call last): File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1349, in _run_fn return self._call_tf_sessionrun(options, feed_dict, fetch_list, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1441, in _call_tf_sessionrun return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key conv3d_33/bias not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2/_301]] (1) Not found: Key conv3d_33/bias not found in checkpoint [[{{node save/RestoreV2}}]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1298, in restore sess.run(self.saver_def.restore_op_name, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 957, in run result = self._run(None, fetches, feed_dict, options_ptr, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1180, in _run results = self._do_run(handle, final_targets, final_fetches, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1358, in _do_run return self._do_call(_run_fn, feeds, fetches, targets, options, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key conv3d_33/bias not found in checkpoint [[node save/RestoreV2 (defined at /data2/rf/res_aligner_net_copy/infer_tfkeras.py:168) ]] [[save/RestoreV2/_301]] (1) Not found: Key conv3d_33/bias not found in checkpoint [[node save/RestoreV2 (defined at /data2/rf/res_aligner_net_copy/infer_tfkeras.py:168) ]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'save/RestoreV2': File "main_infer.py", line 74, in infer(net_core=net_core,model_path=model_path,crop_sz=crop_sz,pair_type="unpaired",rescale_factor=rescale_factor,rescale_factor_label=rescale_factor_label,use_lab=use_lab,test_path=test_paths,model_name=model_name,int_range=int_range) File "/data2/rf/res_aligner_net_copy/infer_tfkeras.py", line 168, in infer saver = tf.train.Saver(max_to_keep=1) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 836, in init self.build() File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 848, in build self._build(self._filename, build_save=True, build_restore=True) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 876, in _build self.saver_def = self._builder._build_internal( # pylint: disable=protected-access File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 515, in _build_internal restore_op = self._AddRestoreOps(filename_tensor, saveables, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 335, in _AddRestoreOps all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1521, in restorev2 , _, _op, _outputs = _op_def_library._apply_op_helper( File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 742, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3477, in _create_op_internal ret = Operation( File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1949, in init self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 69, in get_tensor return CheckpointReader.CheckpointReader_GetTensor( RuntimeError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1309, in restore names_to_keys = object_graph_key_mapping(save_path) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1627, in object_graph_key_mapping object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 74, in get_tensor error_translator(e) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 35, in error_translator raise errors_impl.NotFoundError(None, None, error_message) tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main_infer.py", line 74, in infer(net_core=net_core,model_path=model_path,crop_sz=crop_sz,pair_type="unpaired",rescale_factor=rescale_factor,rescale_factor_label=rescale_factor_label,use_lab=use_lab,test_path=test_paths,model_name=model_name,int_range=int_range) File "/data2/rf/res_aligner_net_copy/infer_tfkeras.py", line 176, in infer saver.restore(sess, save_path, ) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 1314, in restore raise _wrap_restore_error_with_msg( tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found. (0) Not found: Key conv3d_33/bias not found in checkpoint [[node save/RestoreV2 (defined at /data2/rf/res_aligner_net_copy/infer_tfkeras.py:168) ]] [[save/RestoreV2/_301]] (1) Not found: Key conv3d_33/bias not found in checkpoint [[node save/RestoreV2 (defined at /data2/rf/res_aligner_net_copy/infer_tfkeras.py:168) ]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'save/RestoreV2': File "main_infer.py", line 74, in infer(net_core=net_core,model_path=model_path,crop_sz=crop_sz,pair_type="unpaired",rescale_factor=rescale_factor,rescale_factor_label=rescale_factor_label,use_lab=use_lab,test_path=test_paths,model_name=model_name,int_range=int_range) File "/data2/rf/res_aligner_net_copy/infer_tfkeras.py", line 168, in infer saver = tf.train.Saver(max_to_keep=1) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 836, in init self.build() File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 848, in build self._build(self._filename, build_save=True, build_restore=True) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 876, in _build self.saver_def = self._builder._build_internal( # pylint: disable=protected-access File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 515, in _build_internal restore_op = self._AddRestoreOps(filename_tensor, saveables, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 335, in _AddRestoreOps all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1521, in restorev2 , _, _op, _outputs = _op_def_library._apply_op_helper( File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 742, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3477, in _create_op_internal ret = Operation( File "/opt/anaconda3/envs/ran/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1949, in init

Hi mvoofan, Thanks for your interest in this work. This error looks weird. Maybe you can also use the notebook or open in colab if you only need the inference function of this model.

mvoofan commented 1 week ago

Thanks for your help. But the inferring code in colab just cannot run as well. I found that I could train the model with "model_id=1" and infer the model with the generated checkpoint in my sever. However, the outcome is not satisfying enough due to the simple setting with "model_id=1". So I wonder if you can kindly share the training code with "model_id=3".

Hi mvoofan, Thanks for your interest in this work. This error looks weird. Maybe you can also use the notebook or open in colab if you only need the inference function of this model.

jianqingzheng commented 1 week ago

Thanks for your help. But the inferring code in colab just cannot run as well. I found that I could train the model with "model_id=1" and infer the model with the generated checkpoint in my sever. However, the outcome is not satisfying enough due to the simple setting with "model_id=1". So I wonder if you can kindly share the training code with "model_id=3".

Hi mvoofan, Thanks for your interest in this work. This error looks weird. Maybe you can also use the notebook or open in colab if you only need the inference function of this model.

Hi mvoofan, you're right, the code in Colab doesn't work anymore either. But the reason seems to be different: the updated version of tensorflow-keras in Colab no longer supports my current code in Colab (I'll upgrade it later). However, I retried the code on my computer and it still works. model_id===3 is the version I previously trained using the code before reorganizing. I reorganized the code because it was too messy. I will also check and compare it with the original code if I have made any unexpected changes to the model.

In the case if you want train a model by yourself, model_id=1 is a pertained model in synthetic data and not enough for registration in real data. You can complete the training of model_id=2 (based on model_id=1) to achieve the registration performance as described in the paper (model_id=3).

mvoofan commented 1 week ago

Thanks for your help. I am trying to train the model on my local server. I set the np_epoches as [500, 1000] in "main_train.py", and changed the data generator in train_stage=2 to "real_data_generator(train_paths, crop_sz=crop_sz, rescale_factor=1.0, batch_size=batch_size,int_range=int_range)". Then it seems feasible to train the model and achieve the checkpoin with model_id=3 locally. However, it takes about one hour for each epoch on my V100 card, and it will take about an entire month to complete the whole training procedure. I wonder whether the time consuming is normal?

Hi mvoofan, you're right, the code in Colab doesn't work anymore either. But the reason seems to be different: the updated version of tensorflow-keras in Colab no longer supports my current code in Colab (I'll upgrade it later). However, I retried the code on my computer and it still works. model_id===3 is the version I previously trained using the code before reorganizing. I reorganized the code because it was too messy. I will also check and compare it with the original code if I have made any unexpected changes to the model.

In the case if you want train a model by yourself, model_id=1 is a pertained model in synthetic data and not enough for registration in real data. You can complete the training of model_id=2 (based on model_id=1) to achieve the registration performance as described in the paper (model_id=3).

jianqingzheng commented 1 week ago

Thanks for your help. I am trying to train the model on my local server. I set the np_epoches as [500, 1000] in "main_train.py", and changed the data generator in train_stage=2 to "real_data_generator(train_paths, crop_sz=crop_sz, rescale_factor=1.0, batch_size=batch_size,int_range=int_range)". Then it seems feasible to train the model and achieve the checkpoin with model_id=3 locally. However, it takes about one hour for each epoch on my V100 card, and it will take about an entire month to complete the whole training procedure. I wonder whether the time consuming is normal?

Hi mvoofan, you're right, the code in Colab doesn't work anymore either. But the reason seems to be different: the updated version of tensorflow-keras in Colab no longer supports my current code in Colab (I'll upgrade it later). However, I retried the code on my computer and it still works. model_id===3 is the version I previously trained using the code before reorganizing. I reorganized the code because it was too messy. I will also check and compare it with the original code if I have made any unexpected changes to the model.

In the case if you want train a model by yourself, model_id=1 is a pertained model in synthetic data and not enough for registration in real data. You can complete the training of model_id=2 (based on model_id=1) to achieve the registration performance as described in the paper (model_id=3).

It took me 2 weeks to train each model. You can raise the batch size if your GPU's memory allows it.

jianqingzheng / res_aligner_net

Cannot run main_infer.py in the demo, can anyone help me? #2