Training rpn I get a pause on the last step and training doesn't continue.

chamecall commented 5 years ago

I mean if I have 100 iterations per one epoch training stops on the 99 iteration and that's all.

chamecall commented 5 years ago

what I have in output:

`Using TensorFlow backend. /home/algernon/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/algernon/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/algernon/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/algernon/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/algernon/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/algernon/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) /home/algernon/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/algernon/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/algernon/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/algernon/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/algernon/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/algernon/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) WARNING: Logging before flag parsing goes to stderr. W0802 13:20:05.420754 140143857096512 deprecation_wrapper.py:119] From /home/algernon/frcnn-from-scratch-with-keras/train_rpn.py:27: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0802 13:20:05.420959 140143857096512 deprecation_wrapper.py:119] From /home/algernon/frcnn-from-scratch-with-keras/train_rpn.py:29: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-08-02 13:20:05.432460: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-08-02 13:20:05.437076: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2019-08-02 13:20:05.508452: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-08-02 13:20:05.508916: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x45456e0 executing computations on platform CUDA. Devices: 2019-08-02 13:20:05.508933: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1050 Ti, Compute Capability 6.1 2019-08-02 13:20:05.528040: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3192895000 Hz 2019-08-02 13:20:05.528365: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x45fb9e0 executing computations on platform Host. Devices: 2019-08-02 13:20:05.528390: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2019-08-02 13:20:05.528615: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-08-02 13:20:05.529113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.392 pciBusID: 0000:01:00.0 2019-08-02 13:20:05.529329: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-08-02 13:20:05.530611: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-08-02 13:20:05.531723: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2019-08-02 13:20:05.532052: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2019-08-02 13:20:05.533878: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2019-08-02 13:20:05.535283: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2019-08-02 13:20:05.540088: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-08-02 13:20:05.540310: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-08-02 13:20:05.540998: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-08-02 13:20:05.541539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0 2019-08-02 13:20:05.541604: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-08-02 13:20:05.542985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-08-02 13:20:05.543006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2019-08-02 13:20:05.543016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N 2019-08-02 13:20:05.543298: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-08-02 13:20:05.543906: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-08-02 13:20:05.544283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3391 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) data path: ['/home/algernon/samba/video_queue/omega-packaging/experiments/exp-6-contrasting(195 samples)/voc-dataset/VOC2007'] Parsing annotation files {'pb_closed': 60, 'scanner': 22, 'pb_open': 23, 'pb_label': 47} Training images per class: {'bg': 0, 'pb_closed': 60, 'pb_label': 47, 'pb_open': 23, 'scanner': 22} Num classes (including bg) = 5 Config has been written to config.pickle, and can be loaded when testing to ensure correct results Num train samples 53 Num val samples 0 W0802 13:20:05.551901 140143857096512 deprecation_wrapper.py:119] From /home/algernon/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0802 13:20:05.552090 140143857096512 deprecation_wrapper.py:119] From /home/algernon/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0802 13:20:05.553911 140143857096512 deprecation_wrapper.py:119] From /home/algernon/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0802 13:20:05.574116 140143857096512 deprecation_wrapper.py:119] From /home/algernon/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

loading weights from vgg16_weights_tf_dim_ordering_tf_kernels.h5 Could not load pretrained model weights. Weights can be found in the keras application folder https://github.com/fchollet/keras/tree/master/keras/applications W0802 13:20:05.713925 140143857096512 deprecation_wrapper.py:119] From /home/algernon/.local/lib/python3.6/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0802 13:20:05.727538 140143857096512 deprecation.py:323] From /home/algernon/.local/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where Starting training Epoch 1/50 2019-08-02 13:20:18.680424: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-08-02 13:20:20.591851: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.10GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-02 13:20:20.909488: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.81GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-02 13:20:21.948207: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.35GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-02 13:20:22.006549: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.29GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-02 13:20:22.058185: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.29GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-02 13:20:23.506263: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.81GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-02 13:20:23.801280: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.10GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 9/10 [==========================>...] - ETA: 2s - loss: 6.2118 - rpn_out_class_loss: 5.9055 - rpn_out_regress_loss: 0.3064 `

chamecall commented 5 years ago

I set up 10 iterations per epoch and it stopped on the ninth one.

kentaroy47 commented 5 years ago

@chamecall the problem was that validation was tried at the end, even though there was no validation set. removing the validation will solve this. I fixed this in the latest commit, please refer to train_rpn.py

~/frcnn-from-scratch-with-keras$ python train_rpn.py --network mobilenetv2 -o pascal_voc -p ../VOCdevkit/
Using TensorFlow backend.
data path: ['../VOCdevkit/VOC2007']
Parsing annotation files
[Errno 2] No such file or directory: '../VOCdevkit/VOC2007/ImageSets/Main/test.txt'
{'dog': 538, 'cat': 389, 'car': 1644, 'person': 5447, 'chair': 1432, 'bottle': 634, 'diningtable': 310, 'pottedplant': 625, 'bird': 599, 'horse': 406, 'motorbike': 390, 'bus': 272, 'tvmonitor': 367, 'sofa': 425, 'boat': 398, 'cow': 356, 'aeroplane': 331, 'train': 328, 'sheep': 353, 'bicycle': 418}
Training images per class:
{'aeroplane': 331,
 'bg': 0,
 'bicycle': 418,
 'bird': 599,
 'boat': 398,
 'bottle': 634,
 'bus': 272,
 'car': 1644,
 'cat': 389,
 'chair': 1432,
 'cow': 356,
 'diningtable': 310,
 'dog': 538,
 'horse': 406,
 'motorbike': 390,
 'person': 5447,
 'pottedplant': 625,
 'sheep': 353,
 'sofa': 425,
 'train': 328,
 'tvmonitor': 367}
Num classes (including bg) = 21
Config has been written to config.pickle, and can be loaded when testing to ensure correct results
Num train samples 5011
Num val samples 0
WARNING:tensorflow:From /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
loading weights from ./pretrain/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224.h5
loaded weights!
Starting training
Epoch 1/50
1000/1000 [==============================] - 480s 480ms/step - loss: 6.4577 - rpn_out_class_loss: 6.2364 - rpn_out_regress_loss: 0.2213
Epoch 2/50
  36/1000 [>.............................] - ETA: 5:52 - loss: 6.4217 - rpn_out_class_loss: 6.1637 - rpn_out_regress_loss: 0.2580

kentaroy47 / frcnn-from-scratch-with-keras

Training rpn I get a pause on the last step and training doesn't continue. #17