google-deepmind / deepmind-research

This repository contains implementations and illustrative code to accompany DeepMind publications
Apache License 2.0
13.12k stars 2.59k forks source link

Multi-GPU training for "Learning to Simulate (Complex Physics with Graph Networks)" project #182

Closed JihoeKwon closed 2 years ago

JihoeKwon commented 3 years ago

Hi, deepmind-research team

I am a newbie in tensorflow, deep learning, and super-interested in learning to simulate (complex physics with GNN) project.

I downloaded the code, and was able to train it on my own PC with single GPU.

I want to speed-up the training with my on multiple-GPUs, but struggling with that.

Target: deepmind-research-master/learning_to_simulate/train.py

When running the code, the process is loaded on my 10-GPUs, but it seems that the actual training is done in only 1-GPU (gpu{0})

1) Isn't the source code itself compatible with multi-GPUs? Should I put some extra-code into the original code??

The message I've gained is as follows:

WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term INFO:tensorflow:Using default config. I0308 16:29:09.550897 140286859999040 estimator.py:1800] Using default config. INFO:tensorflow:Using config: {'_model_dir': '/data1/kjh/tmp/models/Water-3D', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f95dbe1cb38>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} I0308 16:29:09.551425 140286859999040 estimator.py:212] Using config: {'_model_dir': '/data1/kjh/tmp/models/Water-3D', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f95dbe1cb38>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. W0308 16:29:09.565319 140286859999040 deprecation.py:323] From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. WARNING:tensorflow:Entity <function _yield_value at 0x7f95eb6fee18> appears to be a generator function. It will not be converted by AutoGraph. W0308 16:29:13.016520 140286859999040 ag_logging.py:146] Entity <function _yield_value at 0x7f95eb6fee18> appears to be a generator function. It will not be converted by AutoGraph. INFO:tensorflow:Calling model_fn. I0308 16:29:13.681622 140286859999040 estimator.py:1148] Calling model_fn. WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass _constraint arguments to layers. W0308 16:29:13.746787 140286859999040 deprecation.py:506] From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass _constraint arguments to layers. WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/sonnet/python/modules/basic.py:127: calling TruncatedNormal.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0308 16:29:13.797485 140286859999040 deprecation.py:506] From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/sonnet/python/modules/basic.py:127: calling TruncatedNormal.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/sonnet/python/modules/basic.py:132: calling Zeros.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0308 16:29:13.797735 140286859999040 deprecation.py:506] From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/sonnet/python/modules/basic.py:132: calling Zeros.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /data1/kjh/sources/ML/deepmind-research-master/learning_to_simulate/train_multi.py:348: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0308 16:29:15.019058 140286859999040 deprecation.py:323] From /data1/kjh/sources/ML/deepmind-research-master/learning_to_simulate/train_multi.py:348: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Done calling model_fn. I0308 16:29:20.284817 140286859999040 estimator.py:1150] Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. I0308 16:29:20.286492 140286859999040 basic_session_run_hooks.py:541] Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. I0308 16:29:23.225510 140286859999040 monitored_session.py:240] Graph was finalized. 2021-03-08 16:29:23.225995: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2021-03-08 16:29:23.243941: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2199975000 Hz 2021-03-08 16:29:23.252155: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f971e45ef80 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-03-08 16:29:23.252237: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2021-03-08 16:29:23.258964: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2021-03-08 16:29:27.614881: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f9721112390 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-03-08 16:29:27.614958: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.614983: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615004: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615024: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615044: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (4): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615064: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (5): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615083: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (6): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615103: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (7): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615123: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (8): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615143: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (9): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.629723: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 0 and 9, status: Internal: failed to enable peer access from 0x7f93bca4fe50 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.637082: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 1 and 9, status: Internal: failed to enable peer access from 0x7f93c4a75d10 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.643495: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 2 and 9, status: Internal: failed to enable peer access from 0x7f93cca1af20 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.648897: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 3 and 9, status: Internal: failed to enable peer access from 0x7f93c8a8d360 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.653301: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 4 and 9, status: Internal: failed to enable peer access from 0x7f93d0a66be0 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.656725: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 5 and 9, status: Internal: failed to enable peer access from 0x7f93d8a7e170 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.659129: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 6 and 9, status: Internal: failed to enable peer access from 0x7f93e4a57d80 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.660520: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 7 and 9, status: Internal: failed to enable peer access from 0x7f93eca5f860 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.660915: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 8 and 9, status: Internal: failed to enable peer access from 0x7f93c0a871b0 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661068: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 0, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93bca4fe50: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661218: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 1, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93c4a75d10: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661364: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 2, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93cca1af20: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661514: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 3, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93c8a8d360: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661662: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 4, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93d0a66be0: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661811: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 5, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93d8a7e170: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661960: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 6, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93e4a57d80: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.662097: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 7, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93eca5f860: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.662235: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 8, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93c0a871b0: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.663876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:07:00.0 2021-03-08 16:29:27.665444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:08:00.0 2021-03-08 16:29:27.667014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:0b:00.0 2021-03-08 16:29:27.668594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:10:00.0 2021-03-08 16:29:27.670180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:11:00.0 2021-03-08 16:29:27.671761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:17:00.0 2021-03-08 16:29:27.673334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:18:00.0 2021-03-08 16:29:27.674908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1c:00.0 2021-03-08 16:29:27.676487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 8 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1d:00.0 2021-03-08 16:29:27.678041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 9 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:20:00.0 2021-03-08 16:29:27.678342: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-03-08 16:29:27.681012: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-03-08 16:29:27.683799: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-03-08 16:29:27.684131: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-03-08 16:29:27.686678: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-03-08 16:29:27.688619: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-03-08 16:29:27.694115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-03-08 16:29:27.724532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 2021-03-08 16:29:27.724580: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-03-08 16:29:27.742222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-03-08 16:29:27.742252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 1 2 3 4 5 6 7 8 9 2021-03-08 16:29:27.742268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N Y Y Y Y Y Y Y Y Y 2021-03-08 16:29:27.742279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1: Y N Y Y Y Y Y Y Y Y 2021-03-08 16:29:27.742289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2: Y Y N Y Y Y Y Y Y Y 2021-03-08 16:29:27.742299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3: Y Y Y N Y Y Y Y Y Y 2021-03-08 16:29:27.742309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 4: Y Y Y Y N Y Y Y Y Y 2021-03-08 16:29:27.742320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 5: Y Y Y Y Y N Y Y Y Y 2021-03-08 16:29:27.742329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 6: Y Y Y Y Y Y N Y Y Y 2021-03-08 16:29:27.742340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 7: Y Y Y Y Y Y Y N Y Y 2021-03-08 16:29:27.742350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 8: Y Y Y Y Y Y Y Y N Y 2021-03-08 16:29:27.742360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 9: Y Y Y Y Y Y Y Y Y N 2021-03-08 16:29:27.759199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11160 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:07:00.0, compute capability: 7.0) 2021-03-08 16:29:27.761781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11160 MB memory) -> physical GPU (device: 1, name: TITAN V, pci bus id: 0000:08:00.0, compute capability: 7.0) 2021-03-08 16:29:27.763702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11160 MB memory) -> physical GPU (device: 2, name: TITAN V, pci bus id: 0000:0b:00.0, compute capability: 7.0) 2021-03-08 16:29:27.765556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11160 MB memory) -> physical GPU (device: 3, name: TITAN V, pci bus id: 0000:10:00.0, compute capability: 7.0) 2021-03-08 16:29:27.767968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 11160 MB memory) -> physical GPU (device: 4, name: TITAN V, pci bus id: 0000:11:00.0, compute capability: 7.0) 2021-03-08 16:29:27.770642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 11160 MB memory) -> physical GPU (device: 5, name: TITAN V, pci bus id: 0000:17:00.0, compute capability: 7.0) 2021-03-08 16:29:27.772554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:6 with 11160 MB memory) -> physical GPU (device: 6, name: TITAN V, pci bus id: 0000:18:00.0, compute capability: 7.0) 2021-03-08 16:29:27.775099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 11160 MB memory) -> physical GPU (device: 7, name: TITAN V, pci bus id: 0000:1c:00.0, compute capability: 7.0) 2021-03-08 16:29:27.777203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:8 with 11160 MB memory) -> physical GPU (device: 8, name: TITAN V, pci bus id: 0000:1d:00.0, compute capability: 7.0) 2021-03-08 16:29:27.779338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:9 with 11160 MB memory) -> physical GPU (device: 9, name: TITAN V, pci bus id: 0000:20:00.0, compute capability: 7.0) INFO:tensorflow:Restoring parameters from /data1/kjh/tmp/models/Water-3D/model.ckpt-0 I0308 16:29:27.791676 140286859999040 saver.py:1284] Restoring parameters from /data1/kjh/tmp/models/Water-3D/model.ckpt-0 WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. W0308 16:29:30.046998 140286859999040 deprecation.py:323] From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. INFO:tensorflow:Running local_init_op. I0308 16:29:30.892039 140286859999040 session_manager.py:500] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0308 16:29:31.195132 140286859999040 session_manager.py:502] Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into /data1/kjh/tmp/models/Water-3D/model.ckpt. I0308 16:29:37.925086 140286859999040 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /data1/kjh/tmp/models/Water-3D/model.ckpt. 2021-03-08 16:29:44.132731: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-03-08 16:29:45.118039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:07:00.0 2021-03-08 16:29:45.122157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:08:00.0 2021-03-08 16:29:45.125345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:0b:00.0 2021-03-08 16:29:45.127980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:10:00.0 2021-03-08 16:29:45.129816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:11:00.0 2021-03-08 16:29:45.134012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:17:00.0 2021-03-08 16:29:45.135870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:18:00.0 2021-03-08 16:29:45.140649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1c:00.0 2021-03-08 16:29:45.144485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 8 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1d:00.0 2021-03-08 16:29:45.146781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 9 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:20:00.0 2021-03-08 16:29:45.146883: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-03-08 16:29:45.146920: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-03-08 16:29:45.146954: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-03-08 16:29:45.146992: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-03-08 16:29:45.147026: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-03-08 16:29:45.147063: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-03-08 16:29:45.147100: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-03-08 16:29:45.224796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 2021-03-08 16:29:45.232023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:07:00.0 2021-03-08 16:29:45.238257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:08:00.0 2021-03-08 16:29:45.242457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:0b:00.0 2021-03-08 16:29:45.245995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:10:00.0 2021-03-08 16:29:45.248256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:11:00.0 2021-03-08 16:29:45.250634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:17:00.0 2021-03-08 16:29:45.258209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:18:00.0 2021-03-08 16:29:45.261620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1c:00.0 2021-03-08 16:29:45.266373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 8 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1d:00.0 2021-03-08 16:29:45.268894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 9 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:20:00.0 2021-03-08 16:29:45.268977: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-03-08 16:29:45.269017: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-03-08 16:29:45.269052: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-03-08 16:29:45.269087: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-03-08 16:29:45.269121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-03-08 16:29:45.269155: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-03-08 16:29:45.269190: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-03-08 16:29:45.369259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 2021-03-08 16:29:45.370988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-03-08 16:29:45.371048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 1 2 3 4 5 6 7 8 9 2021-03-08 16:29:45.371092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N Y Y Y Y Y Y Y Y Y 2021-03-08 16:29:45.371130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1: Y N Y Y Y Y Y Y Y Y 2021-03-08 16:29:45.371169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2: Y Y N Y Y Y Y Y Y Y 2021-03-08 16:29:45.371207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3: Y Y Y N Y Y Y Y Y Y 2021-03-08 16:29:45.371245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 4: Y Y Y Y N Y Y Y Y Y 2021-03-08 16:29:45.371283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 5: Y Y Y Y Y N Y Y Y Y 2021-03-08 16:29:45.371321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 6: Y Y Y Y Y Y N Y Y Y 2021-03-08 16:29:45.371358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 7: Y Y Y Y Y Y Y N Y Y 2021-03-08 16:29:45.371395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 8: Y Y Y Y Y Y Y Y N Y 2021-03-08 16:29:45.371431: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 9: Y Y Y Y Y Y Y Y Y N 2021-03-08 16:29:45.424755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11160 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:07:00.0, compute capability: 7.0) 2021-03-08 16:29:45.427476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11160 MB memory) -> physical GPU (device: 1, name: TITAN V, pci bus id: 0000:08:00.0, compute capability: 7.0) 2021-03-08 16:29:45.429903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11160 MB memory) -> physical GPU (device: 2, name: TITAN V, pci bus id: 0000:0b:00.0, compute capability: 7.0) 2021-03-08 16:29:45.435882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11160 MB memory) -> physical GPU (device: 3, name: TITAN V, pci bus id: 0000:10:00.0, compute capability: 7.0) 2021-03-08 16:29:45.442592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 11160 MB memory) -> physical GPU (device: 4, name: TITAN V, pci bus id: 0000:11:00.0, compute capability: 7.0) 2021-03-08 16:29:45.445266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 11160 MB memory) -> physical GPU (device: 5, name: TITAN V, pci bus id: 0000:17:00.0, compute capability: 7.0) 2021-03-08 16:29:45.447999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:6 with 11160 MB memory) -> physical GPU (device: 6, name: TITAN V, pci bus id: 0000:18:00.0, compute capability: 7.0) 2021-03-08 16:29:45.450748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 11160 MB memory) -> physical GPU (device: 7, name: TITAN V, pci bus id: 0000:1c:00.0, compute capability: 7.0) 2021-03-08 16:29:45.454166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:8 with 11160 MB memory) -> physical GPU (device: 8, name: TITAN V, pci bus id: 0000:1d:00.0, compute capability: 7.0) 2021-03-08 16:29:45.464795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:9 with 11160 MB memory) -> physical GPU (device: 9, name: TITAN V, pci bus id: 0000:20:00.0, compute capability: 7.0)

alvarosg commented 3 years ago

Isn't the source code itself compatible with multi-GPUs? Should I put some extra-code into the original code?

Unfortunately we did not really designed that specific piece of code to run on multiple GPUs (or tried to run it on multiple GPUs at all), so it is hard to predict what the issue may be.