How can I check whether training is running on GPU?

chamecall commented 5 years ago

for example I launched the training in the next way: python3 train_frcnn.py --network mobilenetv1 -p ./VOCdevkit after that executed nvidia-smi and didn't see any gpu memory consumption:

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 689 G /usr/lib/xorg/Xorg 194MiB | | 0 2797 G /proc/self/exe 44MiB | | 0 6001 G /usr/lib/firefox/firefox 1MiB | | 0 9034 C python3 43MiB | +-----------------------------------------------------------------------------+` strictly speaking how can I launch a training on the GPU?

kentaroy47 commented 5 years ago

This is weird. training should be conducted on GPU if you installed keras-gpu. Does it properly use GPU with other DNN training using keras?

chamecall commented 5 years ago

This is weird. training should be conducted on GPU if you installed keras-gpu.

As I found out there's no keras-gpu as such. There's different backends with GPU support or without. So I installed tensorflow-gpu.

Does it properly use GPU with other DNN training using keras? I didn't launch DNN using keras yet

chamecall commented 5 years ago

@kentaroy47 my training output

`Using TensorFlow backend. WARNING: Logging before flag parsing goes to stderr. W0718 15:39:21.993449 140716638869312 deprecation_wrapper.py:119] From train_frcnn.py:22: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0718 15:39:21.993684 140716638869312 deprecation_wrapper.py:119] From train_frcnn.py:24: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-07-18 15:39:22.005688: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-07-18 15:39:22.010477: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2019-07-18 15:39:22.090010: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-07-18 15:39:22.090624: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x518d5b0 executing computations on platform CUDA. Devices: 2019-07-18 15:39:22.090642: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1050 Ti, Compute Capability 6.1 2019-07-18 15:39:22.111158: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3192845000 Hz 2019-07-18 15:39:22.111430: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5581ce0 executing computations on platform Host. Devices: 2019-07-18 15:39:22.111452: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2019-07-18 15:39:22.111716: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-07-18 15:39:22.112506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.392 pciBusID: 0000:01:00.0 2019-07-18 15:39:22.112632: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory 2019-07-18 15:39:22.112716: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory 2019-07-18 15:39:22.112794: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory 2019-07-18 15:39:22.112870: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory 2019-07-18 15:39:22.112947: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory 2019-07-18 15:39:22.113023: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory 2019-07-18 15:39:22.118243: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-07-18 15:39:22.118285: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices... 2019-07-18 15:39:22.118312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-07-18 15:39:22.118326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2019-07-18 15:39:22.118337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N data path: ['./dataset/VOCdevkit/VOC2007'] Parsing annotation files [Errno 2] No such file or directory: './dataset/VOCdevkit/VOC2007/ImageSets/Main/test.txt' Training images per class: {'aeroplane': 331, 'bg': 0, 'bicycle': 418, 'bird': 599, 'boat': 398, 'bottle': 634, 'bus': 272, 'car': 1644, 'cat': 389, 'chair': 1432, 'cow': 356, 'diningtable': 310, 'dog': 538, 'horse': 406, 'motorbike': 390, 'person': 5447, 'pottedplant': 625, 'sheep': 353, 'sofa': 425, 'train': 328, 'tvmonitor': 367} Num classes (including bg) = 21 Config has been written to config.pickle, and can be loaded when testing to ensure correct results Num train samples 5011 Num val samples 0 W0718 15:39:22.769392 140716638869312 deprecation_wrapper.py:119] From /home/algernon/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0718 15:39:22.769683 140716638869312 deprecation_wrapper.py:119] From /home/algernon/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0718 15:39:22.778257 140716638869312 deprecation_wrapper.py:119] From /home/algernon/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2019-07-18 15:39:22.814931: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. W0718 15:39:22.816322 140716638869312 deprecation_wrapper.py:119] From /home/algernon/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

W0718 15:39:24.377749 140716638869312 deprecation_wrapper.py:119] From /home/algernon/frcnn-from-scratch-with-keras/keras_frcnn/RoiPoolingConv.py:105: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

W0718 15:39:26.038717 140716638869312 deprecation_wrapper.py:119] From /home/algernon/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3980: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

loading weights from ./pretrain/mobilenet_1_0_224_tf.h5 W0718 15:39:27.599630 140716638869312 deprecation_wrapper.py:119] From /home/algernon/.local/lib/python3.6/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0718 15:39:27.612688 140716638869312 deprecation.py:323] From /home/algernon/.local/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where loading previous rpn model.. no previous model was loaded Starting training Epoch 1/50 1/1000 [..............................] - ETA: 4:07:28 - rpn_cls: 3.8474 - rpn_regr: 0.1748 - detector_cls: 3.0445 - detector_regr`

etc..

chamecall commented 5 years ago

it seems that my cuda version doesn't match tf version. isn't it?

chamecall commented 5 years ago

yes, problem was in CUDA-10.1. I downgraded to CUDA-10.1 and it successfully launched on the GPU.

kentaroy47 / frcnn-from-scratch-with-keras

How can I check whether training is running on GPU? #10