Multi-GPU training for "Learning to Simulate (Complex Physics with Graph Networks)" project

Hi, deepmind-research team

I am a newbie in tensorflow, deep learning, and super-interested in learning to simulate (complex physics with GNN) project.

I downloaded the code, and was able to train it on my own PC with single GPU.

I want to speed-up the training with my on multiple-GPUs, but struggling with that.

Target: deepmind-research-master/learning_to_simulate/train.py

When running the code, the process is loaded on my 10-GPUs, but it seems that the actual training is done in only 1-GPU (gpu{0})

1) Isn't the source code itself compatible with multi-GPUs? Should I put some extra-code into the original code??

The message I've gained is as follows:

WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term INFO:tensorflow:Using default config. I0308 16:29:09.550897 140286859999040 estimator.py:1800] Using default config. INFO:tensorflow:Using config: {'_model_dir': '/data1/kjh/tmp/models/Water-3D', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f95dbe1cb38>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} I0308 16:29:09.551425 140286859999040 estimator.py:212] Using config: {'_model_dir': '/data1/kjh/tmp/models/Water-3D', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f95dbe1cb38>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. W0308 16:29:09.565319 140286859999040 deprecation.py:323] From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. WARNING:tensorflow:Entity <function _yield_value at 0x7f95eb6fee18> appears to be a generator function. It will not be converted by AutoGraph. W0308 16:29:13.016520 140286859999040 ag_logging.py:146] Entity <function _yield_value at 0x7f95eb6fee18> appears to be a generator function. It will not be converted by AutoGraph. INFO:tensorflow:Calling model_fn. I0308 16:29:13.681622 140286859999040 estimator.py:1148] Calling model_fn. WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass _constraint arguments to layers. W0308 16:29:13.746787 140286859999040 deprecation.py:506] From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass _constraint arguments to layers. WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/sonnet/python/modules/basic.py:127: calling TruncatedNormal.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0308 16:29:13.797485 140286859999040 deprecation.py:506] From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/sonnet/python/modules/basic.py:127: calling TruncatedNormal.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/sonnet/python/modules/basic.py:132: calling Zeros.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0308 16:29:13.797735 140286859999040 deprecation.py:506] From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/sonnet/python/modules/basic.py:132: calling Zeros.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /data1/kjh/sources/ML/deepmind-research-master/learning_to_simulate/train_multi.py:348: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0308 16:29:15.019058 140286859999040 deprecation.py:323] From /data1/kjh/sources/ML/deepmind-research-master/learning_to_simulate/train_multi.py:348: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Done calling model_fn. I0308 16:29:20.284817 140286859999040 estimator.py:1150] Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. I0308 16:29:20.286492 140286859999040 basic_session_run_hooks.py:541] Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. I0308 16:29:23.225510 140286859999040 monitored_session.py:240] Graph was finalized. 2021-03-08 16:29:23.225995: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2021-03-08 16:29:23.243941: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2199975000 Hz 2021-03-08 16:29:23.252155: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f971e45ef80 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-03-08 16:29:23.252237: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2021-03-08 16:29:23.258964: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2021-03-08 16:29:27.614881: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f9721112390 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-03-08 16:29:27.614958: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.614983: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615004: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615024: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615044: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (4): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615064: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (5): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615083: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (6): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615103: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (7): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615123: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (8): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.615143: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (9): TITAN V, Compute Capability 7.0 2021-03-08 16:29:27.629723: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 0 and 9, status: Internal: failed to enable peer access from 0x7f93bca4fe50 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.637082: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 1 and 9, status: Internal: failed to enable peer access from 0x7f93c4a75d10 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.643495: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 2 and 9, status: Internal: failed to enable peer access from 0x7f93cca1af20 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.648897: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 3 and 9, status: Internal: failed to enable peer access from 0x7f93c8a8d360 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.653301: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 4 and 9, status: Internal: failed to enable peer access from 0x7f93d0a66be0 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.656725: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 5 and 9, status: Internal: failed to enable peer access from 0x7f93d8a7e170 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.659129: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 6 and 9, status: Internal: failed to enable peer access from 0x7f93e4a57d80 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.660520: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 7 and 9, status: Internal: failed to enable peer access from 0x7f93eca5f860 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.660915: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 8 and 9, status: Internal: failed to enable peer access from 0x7f93c0a871b0 to 0x7f93e0a6e220: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661068: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 0, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93bca4fe50: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661218: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 1, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93c4a75d10: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661364: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 2, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93cca1af20: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661514: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 3, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93c8a8d360: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661662: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 4, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93d0a66be0: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661811: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 5, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93d8a7e170: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.661960: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 6, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93e4a57d80: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.662097: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 7, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93eca5f860: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.662235: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1596] Unable to enable peer access between device ordinals 9 and 8, status: Internal: failed to enable peer access from 0x7f93e0a6e220 to 0x7f93c0a871b0: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted 2021-03-08 16:29:27.663876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:07:00.0 2021-03-08 16:29:27.665444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:08:00.0 2021-03-08 16:29:27.667014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:0b:00.0 2021-03-08 16:29:27.668594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:10:00.0 2021-03-08 16:29:27.670180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:11:00.0 2021-03-08 16:29:27.671761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:17:00.0 2021-03-08 16:29:27.673334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:18:00.0 2021-03-08 16:29:27.674908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1c:00.0 2021-03-08 16:29:27.676487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 8 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1d:00.0 2021-03-08 16:29:27.678041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 9 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:20:00.0 2021-03-08 16:29:27.678342: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-03-08 16:29:27.681012: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-03-08 16:29:27.683799: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-03-08 16:29:27.684131: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-03-08 16:29:27.686678: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-03-08 16:29:27.688619: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-03-08 16:29:27.694115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-03-08 16:29:27.724532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 2021-03-08 16:29:27.724580: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-03-08 16:29:27.742222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-03-08 16:29:27.742252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 1 2 3 4 5 6 7 8 9 2021-03-08 16:29:27.742268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N Y Y Y Y Y Y Y Y Y 2021-03-08 16:29:27.742279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1: Y N Y Y Y Y Y Y Y Y 2021-03-08 16:29:27.742289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2: Y Y N Y Y Y Y Y Y Y 2021-03-08 16:29:27.742299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3: Y Y Y N Y Y Y Y Y Y 2021-03-08 16:29:27.742309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 4: Y Y Y Y N Y Y Y Y Y 2021-03-08 16:29:27.742320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 5: Y Y Y Y Y N Y Y Y Y 2021-03-08 16:29:27.742329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 6: Y Y Y Y Y Y N Y Y Y 2021-03-08 16:29:27.742340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 7: Y Y Y Y Y Y Y N Y Y 2021-03-08 16:29:27.742350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 8: Y Y Y Y Y Y Y Y N Y 2021-03-08 16:29:27.742360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 9: Y Y Y Y Y Y Y Y Y N 2021-03-08 16:29:27.759199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11160 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:07:00.0, compute capability: 7.0) 2021-03-08 16:29:27.761781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11160 MB memory) -> physical GPU (device: 1, name: TITAN V, pci bus id: 0000:08:00.0, compute capability: 7.0) 2021-03-08 16:29:27.763702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11160 MB memory) -> physical GPU (device: 2, name: TITAN V, pci bus id: 0000:0b:00.0, compute capability: 7.0) 2021-03-08 16:29:27.765556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11160 MB memory) -> physical GPU (device: 3, name: TITAN V, pci bus id: 0000:10:00.0, compute capability: 7.0) 2021-03-08 16:29:27.767968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 11160 MB memory) -> physical GPU (device: 4, name: TITAN V, pci bus id: 0000:11:00.0, compute capability: 7.0) 2021-03-08 16:29:27.770642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 11160 MB memory) -> physical GPU (device: 5, name: TITAN V, pci bus id: 0000:17:00.0, compute capability: 7.0) 2021-03-08 16:29:27.772554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:6 with 11160 MB memory) -> physical GPU (device: 6, name: TITAN V, pci bus id: 0000:18:00.0, compute capability: 7.0) 2021-03-08 16:29:27.775099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 11160 MB memory) -> physical GPU (device: 7, name: TITAN V, pci bus id: 0000:1c:00.0, compute capability: 7.0) 2021-03-08 16:29:27.777203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:8 with 11160 MB memory) -> physical GPU (device: 8, name: TITAN V, pci bus id: 0000:1d:00.0, compute capability: 7.0) 2021-03-08 16:29:27.779338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:9 with 11160 MB memory) -> physical GPU (device: 9, name: TITAN V, pci bus id: 0000:20:00.0, compute capability: 7.0) INFO:tensorflow:Restoring parameters from /data1/kjh/tmp/models/Water-3D/model.ckpt-0 I0308 16:29:27.791676 140286859999040 saver.py:1284] Restoring parameters from /data1/kjh/tmp/models/Water-3D/model.ckpt-0 WARNING:tensorflow:From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. W0308 16:29:30.046998 140286859999040 deprecation.py:323] From /root/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. INFO:tensorflow:Running local_init_op. I0308 16:29:30.892039 140286859999040 session_manager.py:500] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0308 16:29:31.195132 140286859999040 session_manager.py:502] Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into /data1/kjh/tmp/models/Water-3D/model.ckpt. I0308 16:29:37.925086 140286859999040 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /data1/kjh/tmp/models/Water-3D/model.ckpt. 2021-03-08 16:29:44.132731: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-03-08 16:29:45.118039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:07:00.0 2021-03-08 16:29:45.122157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:08:00.0 2021-03-08 16:29:45.125345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:0b:00.0 2021-03-08 16:29:45.127980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:10:00.0 2021-03-08 16:29:45.129816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:11:00.0 2021-03-08 16:29:45.134012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:17:00.0 2021-03-08 16:29:45.135870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:18:00.0 2021-03-08 16:29:45.140649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1c:00.0 2021-03-08 16:29:45.144485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 8 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1d:00.0 2021-03-08 16:29:45.146781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 9 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:20:00.0 2021-03-08 16:29:45.146883: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-03-08 16:29:45.146920: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-03-08 16:29:45.146954: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-03-08 16:29:45.146992: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-03-08 16:29:45.147026: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-03-08 16:29:45.147063: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-03-08 16:29:45.147100: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-03-08 16:29:45.224796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 2021-03-08 16:29:45.232023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:07:00.0 2021-03-08 16:29:45.238257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:08:00.0 2021-03-08 16:29:45.242457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:0b:00.0 2021-03-08 16:29:45.245995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:10:00.0 2021-03-08 16:29:45.248256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:11:00.0 2021-03-08 16:29:45.250634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:17:00.0 2021-03-08 16:29:45.258209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:18:00.0 2021-03-08 16:29:45.261620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1c:00.0 2021-03-08 16:29:45.266373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 8 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:1d:00.0 2021-03-08 16:29:45.268894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 9 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:20:00.0 2021-03-08 16:29:45.268977: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-03-08 16:29:45.269017: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-03-08 16:29:45.269052: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-03-08 16:29:45.269087: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-03-08 16:29:45.269121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-03-08 16:29:45.269155: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-03-08 16:29:45.269190: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-03-08 16:29:45.369259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 2021-03-08 16:29:45.370988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-03-08 16:29:45.371048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 1 2 3 4 5 6 7 8 9 2021-03-08 16:29:45.371092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N Y Y Y Y Y Y Y Y Y 2021-03-08 16:29:45.371130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1: Y N Y Y Y Y Y Y Y Y 2021-03-08 16:29:45.371169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2: Y Y N Y Y Y Y Y Y Y 2021-03-08 16:29:45.371207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3: Y Y Y N Y Y Y Y Y Y 2021-03-08 16:29:45.371245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 4: Y Y Y Y N Y Y Y Y Y 2021-03-08 16:29:45.371283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 5: Y Y Y Y Y N Y Y Y Y 2021-03-08 16:29:45.371321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 6: Y Y Y Y Y Y N Y Y Y 2021-03-08 16:29:45.371358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 7: Y Y Y Y Y Y Y N Y Y 2021-03-08 16:29:45.371395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 8: Y Y Y Y Y Y Y Y N Y 2021-03-08 16:29:45.371431: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 9: Y Y Y Y Y Y Y Y Y N 2021-03-08 16:29:45.424755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11160 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:07:00.0, compute capability: 7.0) 2021-03-08 16:29:45.427476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11160 MB memory) -> physical GPU (device: 1, name: TITAN V, pci bus id: 0000:08:00.0, compute capability: 7.0) 2021-03-08 16:29:45.429903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11160 MB memory) -> physical GPU (device: 2, name: TITAN V, pci bus id: 0000:0b:00.0, compute capability: 7.0) 2021-03-08 16:29:45.435882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11160 MB memory) -> physical GPU (device: 3, name: TITAN V, pci bus id: 0000:10:00.0, compute capability: 7.0) 2021-03-08 16:29:45.442592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 11160 MB memory) -> physical GPU (device: 4, name: TITAN V, pci bus id: 0000:11:00.0, compute capability: 7.0) 2021-03-08 16:29:45.445266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 11160 MB memory) -> physical GPU (device: 5, name: TITAN V, pci bus id: 0000:17:00.0, compute capability: 7.0) 2021-03-08 16:29:45.447999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:6 with 11160 MB memory) -> physical GPU (device: 6, name: TITAN V, pci bus id: 0000:18:00.0, compute capability: 7.0) 2021-03-08 16:29:45.450748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 11160 MB memory) -> physical GPU (device: 7, name: TITAN V, pci bus id: 0000:1c:00.0, compute capability: 7.0) 2021-03-08 16:29:45.454166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:8 with 11160 MB memory) -> physical GPU (device: 8, name: TITAN V, pci bus id: 0000:1d:00.0, compute capability: 7.0) 2021-03-08 16:29:45.464795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:9 with 11160 MB memory) -> physical GPU (device: 9, name: TITAN V, pci bus id: 0000:20:00.0, compute capability: 7.0)

google-deepmind / deepmind-research

Multi-GPU training for "Learning to Simulate (Complex Physics with Graph Networks)" project #182