tensorpack / tensorpack

A Neural Net Training Interface on TensorFlow, with focus on speed + flexibility
Apache License 2.0
6.3k stars 1.81k forks source link

FasterRCNN example - prediction yields no results? #1289

Closed ghost closed 5 years ago

ghost commented 5 years ago

1. What you did:

(1) If you're using examples, what's the command you run: python predict.py --predict ../data/training_data/COCO/train2014/COCO_train2014_000000000009.jpg --load ../data/tensorpack_logs/checkpoint

(2) If you're using examples, have you made any changes to the examples? Paste git status; git diff here:

I used the FasterRCNN example as of 242dc71cafb9642e68a2bfb58bcf6ad45ccbb35c, only changing the directories.

2. What you observed:

Logs from GPU cluster I trained on

2019-07-27 13:05:31.701700: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-07-27 13:05:32.197451: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
2019-07-27 13:05:32.224220: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-07-27 13:05:32.472828: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-07-27 13:05:32.604611: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-07-27 13:05:32.666283: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-07-27 13:05:32.967044: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-07-27 13:05:33.187441: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-07-27 13:05:33.744101: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-07-27 13:05:33.747219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
[32m[0727 13:05:33 @config.py:325][0m Config: ------------------------------------------
{'BACKBONE': {'FREEZE_AFFINE': False,
              'FREEZE_AT': 2,
              'NORM': 'FreezeBN',
              'RESNET_NUM_BLOCKS': [3, 4, 6, 3],
              'STRIDE_1X1': False,
              'TF_PAD_MODE': False,
              'WEIGHTS': '/share/lab-backedup/tensorpack/data/weights/ImageNet-R50-GroupNorm32-AlignPadding.npz'},
 'CASCADE': {'BBOX_REG_WEIGHTS': [[10.0, 10.0, 5.0, 5.0], [20.0, 20.0, 10.0, 10.0],
                                  [30.0, 30.0, 15.0, 15.0]],
             'IOUS': [0.5, 0.6, 0.7]},
 'DATA': {'ABSOLUTE_COORD': True,
          'BASEDIR': '/share/lab-backedup/tensorpack/data/training_data/COCO',
          'CLASS_NAMES': ['BG', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
                          'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
                          'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
                          'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
                          'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite',
                          'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
                          'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
                          'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
                          'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
                          'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
                          'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
                          'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
                          'hair drier', 'toothbrush'],
          'NUM_CATEGORY': 80,
          'NUM_WORKERS': 10,
          'TRAIN': ('coco_train2014', 'coco_valminusminival2014'),
          'VAL': ('coco_minival2014',)},
 'FPN': {'ANCHOR_STRIDES': (4, 8, 16, 32, 64),
         'CASCADE': False,
         'FRCNN_CONV_HEAD_DIM': 256,
         'FRCNN_FC_HEAD_DIM': 1024,
         'FRCNN_HEAD_FUNC': 'fastrcnn_2fc_head',
         'MRCNN_HEAD_FUNC': 'maskrcnn_up4conv_head',
         'NORM': 'None',
         'NUM_CHANNEL': 256,
         'PROPOSAL_MODE': 'Level',
         'RESOLUTION_REQUIREMENT': 32},
 'FRCNN': {'BATCH_PER_IM': 512,
           'BBOX_REG_WEIGHTS': [10.0, 10.0, 5.0, 5.0],
           'FG_RATIO': 0.25,
           'FG_THRESH': 0.5},
 'MODE_FPN': False,
 'MODE_MASK': False,
 'MRCNN': {'HEAD_DIM': 256},
 'PREPROC': {'MAX_SIZE': 1333,
             'PIXEL_MEAN': [123.675, 116.28, 103.53],
             'PIXEL_STD': [58.395, 57.12, 57.375],
             'TEST_SHORT_EDGE_SIZE': 800,
             'TRAIN_SHORT_EDGE_SIZE': [800, 800]},
 'RPN': {'ANCHOR_RATIOS': (0.5, 1.0, 2.0),
         'ANCHOR_SIZES': (32, 64, 128, 256, 512),
         'ANCHOR_STRIDE': 16,
         'BATCH_PER_IM': 256,
         'CROWD_OVERLAP_THRESH': 9.99,
         'FG_RATIO': 0.5,
         'HEAD_DIM': 1024,
         'MIN_SIZE': 0,
         'NEGATIVE_ANCHOR_THRESH': 0.3,
         'NUM_ANCHOR': 15,
         'POSITIVE_ANCHOR_THRESH': 0.7,
         'PROPOSAL_NMS_THRESH': 0.7,
         'TEST_PER_LEVEL_NMS_TOPK': 1000,
         'TEST_POST_NMS_TOPK': 1000,
         'TEST_PRE_NMS_TOPK': 6000,
         'TRAIN_PER_LEVEL_NMS_TOPK': 2000,
         'TRAIN_POST_NMS_TOPK': 2000,
         'TRAIN_PRE_NMS_TOPK': 12000},
 'TEST': {'FRCNN_NMS_THRESH': 0.5,
          'RESULTS_PER_IM': 100,
          'RESULT_SCORE_THRESH': 0.05,
          'RESULT_SCORE_THRESH_VIS': 0.5},
 'TRAIN': {'BASE_LR': 0.01,
           'EVAL_PERIOD': 25,
           'LR_SCHEDULE': [120000, 160000, 180000],
           'NUM_GPUS': 1,
           'STARTING_EPOCH': 1,
           'STEPS_PER_EPOCH': 500,
           'WARMUP': 1000,
           'WARMUP_INIT_LR': 0.0033000000000000004,
           'WEIGHT_DECAY': 0.0001},
 'TRAINER': 'replicated'}
[32m[0727 13:05:33 @sesscreate.py:38][0m [5m[31mWRN[0m User-provided custom session config may not work due to TF bugs. See https://github.com/tensorpack/tensorpack/issues/497 for workarounds.
[32m[0727 13:05:34 @registry.py:90][0m 'conv0': [1, 3, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:05:34 @registry.py:90][0m 'pool0': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:05:34 @registry.py:90][0m 'group0/block0/conv1': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:05:34 @registry.py:90][0m 'group0/block0/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:05:34 @registry.py:90][0m 'group0/block0/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:34 @registry.py:90][0m 'group0/block0/convshortcut': [1, 64, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group0/block1/conv1': [1, 256, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group0/block1/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group0/block1/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group0/block2/conv1': [1, 256, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group0/block2/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group0/block2/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block0/conv1': [1, 256, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block0/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block0/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block0/convshortcut': [1, 256, ?, ?] --> [1, 512, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block1/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block1/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block1/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block2/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block2/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block2/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block3/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block3/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group1/block3/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group2/block0/conv1': [1, 512, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:35 @registry.py:90][0m 'group2/block0/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block0/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block0/convshortcut': [1, 512, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block1/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block1/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block1/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block2/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block2/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block2/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block3/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block3/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block3/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block4/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block4/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block4/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block5/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block5/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m 'group2/block5/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:05:36 @registry.py:80][0m 'rpn' input: [1, 1024, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m   'rpn/conv0': [1, 1024, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m   'rpn/class': [1, 1024, ?, ?] --> [1, 15, ?, ?]
[32m[0727 13:05:36 @registry.py:90][0m   'rpn/box': [1, 1024, ?, ?] --> [1, 60, ?, ?]
[32m[0727 13:05:36 @registry.py:93][0m 'rpn' output: [?, ?, 15], [?, ?, 15, 4]
[32m[0727 13:05:37 @registry.py:90][0m 'group3/block0/conv1': [?, 1024, 14, 14] --> [?, 512, 14, 14]
[32m[0727 13:05:37 @registry.py:90][0m 'group3/block0/conv2': [?, 512, 15, 15] --> [?, 512, 7, 7]
[32m[0727 13:05:37 @registry.py:90][0m 'group3/block0/conv3': [?, 512, 7, 7] --> [?, 2048, 7, 7]
[32m[0727 13:05:37 @registry.py:90][0m 'group3/block0/convshortcut': [?, 1024, 13, 13] --> [?, 2048, 7, 7]
[32m[0727 13:05:37 @registry.py:90][0m 'group3/block1/conv1': [?, 2048, 7, 7] --> [?, 512, 7, 7]
[32m[0727 13:05:37 @registry.py:90][0m 'group3/block1/conv2': [?, 512, 7, 7] --> [?, 512, 7, 7]
[32m[0727 13:05:37 @registry.py:90][0m 'group3/block1/conv3': [?, 512, 7, 7] --> [?, 2048, 7, 7]
[32m[0727 13:05:37 @registry.py:90][0m 'group3/block2/conv1': [?, 2048, 7, 7] --> [?, 512, 7, 7]
[32m[0727 13:05:37 @registry.py:90][0m 'group3/block2/conv2': [?, 512, 7, 7] --> [?, 512, 7, 7]
[32m[0727 13:05:37 @registry.py:90][0m 'group3/block2/conv3': [?, 512, 7, 7] --> [?, 2048, 7, 7]
[32m[0727 13:05:37 @registry.py:90][0m 'gap': [?, 2048, 7, 7] --> [?, 2048]
[32m[0727 13:05:37 @registry.py:80][0m 'fastrcnn' input: [?, 2048]
[32m[0727 13:05:37 @registry.py:90][0m   'fastrcnn/class': [?, 2048] --> [?, 81]
[32m[0727 13:05:37 @registry.py:90][0m   'fastrcnn/box': [?, 2048] --> [?, 324]
[32m[0727 13:05:37 @registry.py:93][0m 'fastrcnn' output: [?, 81], [?, 81, 4]
[32m[0727 13:05:38 @collection.py:146][0m New collections created in tower : tf.GraphKeys.MODEL_VARIABLES of size 55
[32m[0727 13:05:38 @sessinit.py:87][0m [5m[31mWRN[0m The following variables are in the checkpoint, but not found in the graph: global_step, learning_rate
2019-07-27 13:05:38.402872: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: FMA
2019-07-27 13:05:38.788042: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6748250 executing computations on platform CUDA. Devices:
2019-07-27 13:05:38.788150: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): TITAN Xp, Compute Capability 6.1
2019-07-27 13:05:38.814740: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499810000 Hz
2019-07-27 13:05:38.825117: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6925e30 executing computations on platform Host. Devices:
2019-07-27 13:05:38.825211: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-27 13:05:38.827034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
2019-07-27 13:05:38.827520: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-07-27 13:05:38.827573: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-07-27 13:05:38.827641: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-07-27 13:05:38.827700: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-07-27 13:05:38.827759: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-07-27 13:05:38.827817: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-07-27 13:05:38.827875: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-07-27 13:05:38.829914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-07-27 13:05:38.829993: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-07-27 13:05:38.844319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-27 13:05:38.844365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-07-27 13:05:38.844383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-07-27 13:05:38.847039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 12074 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
[32m[0727 13:05:49 @sessinit.py:114][0m Restoring checkpoint from ../data/tensorpack_logs/model-300000 ...
2019-07-27 13:05:54.249876: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2019-07-27 13:05:54.338972: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-07-27 13:05:56.137177: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
[32m[0727 13:06:01 @predict.py:125][0m Inference output for ../data/training_data/COCO/train2014/COCO_train2014_000000000009.jpg written to output.png
: cannot connect to X server

Logs from my laptop

2019-07-27 13:02:17.226940: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0727 13:02:17.947665 139936397661824 __init__.py:308] Limited tf.compat.v2.summary API due to missing TensorBoard installation.
2019-07-27 13:02:18.234661: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-07-27 13:02:18.243914: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-27 13:02:18.244432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493
pciBusID: 0000:01:00.0
2019-07-27 13:02:18.244465: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-07-27 13:02:18.245883: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-07-27 13:02:18.247235: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-07-27 13:02:18.247531: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-07-27 13:02:18.248976: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-07-27 13:02:18.249818: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-07-27 13:02:18.252726: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-07-27 13:02:18.252861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-27 13:02:18.253515: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-27 13:02:18.254004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
[32m[0727 13:02:18 @config.py:325][0m Config: ------------------------------------------
{'BACKBONE': {'FREEZE_AFFINE': False,
              'FREEZE_AT': 2,
              'NORM': 'FreezeBN',
              'RESNET_NUM_BLOCKS': [3, 4, 6, 3],
              'STRIDE_1X1': False,
              'TF_PAD_MODE': False,
              'WEIGHTS': '/home/d/Documents/Code/Lab/face/data/weights/ImageNet-R50-GroupNorm32-AlignPadding.npz'},
 'CASCADE': {'BBOX_REG_WEIGHTS': [[10.0, 10.0, 5.0, 5.0], [20.0, 20.0, 10.0, 10.0],
                                  [30.0, 30.0, 15.0, 15.0]],
             'IOUS': [0.5, 0.6, 0.7]},
 'DATA': {'ABSOLUTE_COORD': True,
          'BASEDIR': '/home/d/Documents/Code/Lab/face/data/training_data/COCO',
          'CLASS_NAMES': ['BG', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
                          'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
                          'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
                          'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
                          'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite',
                          'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
                          'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
                          'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
                          'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
                          'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
                          'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
                          'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
                          'hair drier', 'toothbrush'],
          'NUM_CATEGORY': 80,
          'NUM_WORKERS': 10,
          'TRAIN': ('coco_train2014', 'coco_valminusminival2014'),
          'VAL': ('coco_minival2014',)},
 'FPN': {'ANCHOR_STRIDES': (4, 8, 16, 32, 64),
         'CASCADE': False,
         'FRCNN_CONV_HEAD_DIM': 256,
         'FRCNN_FC_HEAD_DIM': 1024,
         'FRCNN_HEAD_FUNC': 'fastrcnn_2fc_head',
         'MRCNN_HEAD_FUNC': 'maskrcnn_up4conv_head',
         'NORM': 'None',
         'NUM_CHANNEL': 256,
         'PROPOSAL_MODE': 'Level',
         'RESOLUTION_REQUIREMENT': 32},
 'FRCNN': {'BATCH_PER_IM': 512,
           'BBOX_REG_WEIGHTS': [10.0, 10.0, 5.0, 5.0],
           'FG_RATIO': 0.25,
           'FG_THRESH': 0.5},
 'MODE_FPN': False,
 'MODE_MASK': False,
 'MRCNN': {'HEAD_DIM': 256},
 'PREPROC': {'MAX_SIZE': 1333,
             'PIXEL_MEAN': [123.675, 116.28, 103.53],
             'PIXEL_STD': [58.395, 57.12, 57.375],
             'TEST_SHORT_EDGE_SIZE': 800,
             'TRAIN_SHORT_EDGE_SIZE': [800, 800]},
 'RPN': {'ANCHOR_RATIOS': (0.5, 1.0, 2.0),
         'ANCHOR_SIZES': (32, 64, 128, 256, 512),
         'ANCHOR_STRIDE': 16,
         'BATCH_PER_IM': 256,
         'CROWD_OVERLAP_THRESH': 9.99,
         'FG_RATIO': 0.5,
         'HEAD_DIM': 1024,
         'MIN_SIZE': 0,
         'NEGATIVE_ANCHOR_THRESH': 0.3,
         'NUM_ANCHOR': 15,
         'POSITIVE_ANCHOR_THRESH': 0.7,
         'PROPOSAL_NMS_THRESH': 0.7,
         'TEST_PER_LEVEL_NMS_TOPK': 1000,
         'TEST_POST_NMS_TOPK': 1000,
         'TEST_PRE_NMS_TOPK': 6000,
         'TRAIN_PER_LEVEL_NMS_TOPK': 2000,
         'TRAIN_POST_NMS_TOPK': 2000,
         'TRAIN_PRE_NMS_TOPK': 12000},
 'TEST': {'FRCNN_NMS_THRESH': 0.5,
          'RESULTS_PER_IM': 100,
          'RESULT_SCORE_THRESH': 0.05,
          'RESULT_SCORE_THRESH_VIS': 0.5},
 'TRAIN': {'BASE_LR': 0.01,
           'EVAL_PERIOD': 25,
           'LR_SCHEDULE': [120000, 160000, 180000],
           'NUM_GPUS': 1,
           'STARTING_EPOCH': 1,
           'STEPS_PER_EPOCH': 500,
           'WARMUP': 1000,
           'WARMUP_INIT_LR': 0.0033000000000000004,
           'WEIGHT_DECAY': 0.0001},
 'TRAINER': 'replicated'}
[32m[0727 13:02:18 @sesscreate.py:38][0m [5m[31mWRN[0m User-provided custom session config may not work due to TF bugs. See https://github.com/tensorpack/tensorpack/issues/497 for workarounds.
[32m[0727 13:02:18 @registry.py:90][0m 'conv0': [1, 3, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'pool0': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group0/block0/conv1': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group0/block0/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group0/block0/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group0/block0/convshortcut': [1, 64, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group0/block1/conv1': [1, 256, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group0/block1/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group0/block1/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group0/block2/conv1': [1, 256, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group0/block2/conv2': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group0/block2/conv3': [1, 64, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group1/block0/conv1': [1, 256, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:02:18 @registry.py:90][0m 'group1/block0/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group1/block0/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group1/block0/convshortcut': [1, 256, ?, ?] --> [1, 512, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group1/block1/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group1/block1/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group1/block1/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group1/block2/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group1/block2/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group1/block2/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group1/block3/conv1': [1, 512, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group1/block3/conv2': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group1/block3/conv3': [1, 128, ?, ?] --> [1, 512, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block0/conv1': [1, 512, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block0/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block0/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block0/convshortcut': [1, 512, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block1/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block1/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block1/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block2/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block2/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block2/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block3/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block3/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block3/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block4/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block4/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block4/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block5/conv1': [1, 1024, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:19 @registry.py:90][0m 'group2/block5/conv2': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0727 13:02:20 @registry.py:90][0m 'group2/block5/conv3': [1, 256, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:02:20 @registry.py:80][0m 'rpn' input: [1, 1024, ?, ?]
[32m[0727 13:02:20 @registry.py:90][0m   'rpn/conv0': [1, 1024, ?, ?] --> [1, 1024, ?, ?]
[32m[0727 13:02:20 @registry.py:90][0m   'rpn/class': [1, 1024, ?, ?] --> [1, 15, ?, ?]
[32m[0727 13:02:20 @registry.py:90][0m   'rpn/box': [1, 1024, ?, ?] --> [1, 60, ?, ?]
[32m[0727 13:02:20 @registry.py:93][0m 'rpn' output: [?, ?, 15], [?, ?, 15, 4]
[32m[0727 13:02:20 @registry.py:90][0m 'group3/block0/conv1': [?, 1024, 14, 14] --> [?, 512, 14, 14]
[32m[0727 13:02:20 @registry.py:90][0m 'group3/block0/conv2': [?, 512, 15, 15] --> [?, 512, 7, 7]
[32m[0727 13:02:20 @registry.py:90][0m 'group3/block0/conv3': [?, 512, 7, 7] --> [?, 2048, 7, 7]
[32m[0727 13:02:20 @registry.py:90][0m 'group3/block0/convshortcut': [?, 1024, 13, 13] --> [?, 2048, 7, 7]
[32m[0727 13:02:20 @registry.py:90][0m 'group3/block1/conv1': [?, 2048, 7, 7] --> [?, 512, 7, 7]
[32m[0727 13:02:20 @registry.py:90][0m 'group3/block1/conv2': [?, 512, 7, 7] --> [?, 512, 7, 7]
[32m[0727 13:02:20 @registry.py:90][0m 'group3/block1/conv3': [?, 512, 7, 7] --> [?, 2048, 7, 7]
[32m[0727 13:02:20 @registry.py:90][0m 'group3/block2/conv1': [?, 2048, 7, 7] --> [?, 512, 7, 7]
[32m[0727 13:02:20 @registry.py:90][0m 'group3/block2/conv2': [?, 512, 7, 7] --> [?, 512, 7, 7]
[32m[0727 13:02:20 @registry.py:90][0m 'group3/block2/conv3': [?, 512, 7, 7] --> [?, 2048, 7, 7]
[32m[0727 13:02:20 @registry.py:90][0m 'gap': [?, 2048, 7, 7] --> [?, 2048]
[32m[0727 13:02:20 @registry.py:80][0m 'fastrcnn' input: [?, 2048]
[32m[0727 13:02:20 @registry.py:90][0m   'fastrcnn/class': [?, 2048] --> [?, 81]
[32m[0727 13:02:20 @registry.py:90][0m   'fastrcnn/box': [?, 2048] --> [?, 324]
[32m[0727 13:02:20 @registry.py:93][0m 'fastrcnn' output: [?, 81], [?, 81, 4]
[32m[0727 13:02:20 @collection.py:146][0m New collections created in tower : tf.GraphKeys.MODEL_VARIABLES of size 55
[32m[0727 13:02:20 @sessinit.py:87][0m [5m[31mWRN[0m The following variables are in the checkpoint, but not found in the graph: global_step, learning_rate
2019-07-27 13:02:20.926858: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2809000000 Hz
2019-07-27 13:02:20.927470: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5640a156c010 executing computations on platform Host. Devices:
2019-07-27 13:02:20.927485: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-27 13:02:20.927693: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-27 13:02:20.928219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493
pciBusID: 0000:01:00.0
2019-07-27 13:02:20.928264: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-07-27 13:02:20.928320: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-07-27 13:02:20.928356: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-07-27 13:02:20.928372: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-07-27 13:02:20.928392: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-07-27 13:02:20.928425: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-07-27 13:02:20.928456: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-07-27 13:02:20.928529: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-27 13:02:20.929059: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-27 13:02:20.929427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-07-27 13:02:20.929469: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-07-27 13:02:21.367982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-27 13:02:21.368021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-07-27 13:02:21.368045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-07-27 13:02:21.368201: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-27 13:02:21.368501: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-27 13:02:21.368769: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-27 13:02:21.369021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4001 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-07-27 13:02:21.370466: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5640b3830c10 executing computations on platform CUDA. Devices:
2019-07-27 13:02:21.370479: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1050, Compute Capability 6.1
[32m[0727 13:02:22 @sessinit.py:114][0m Restoring checkpoint from ../data/tensorpack_logs/model-300000 ...
2019-07-27 13:02:23.002124: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2019-07-27 13:02:23.032603: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-07-27 13:02:23.221041: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
[32m[0727 13:02:24 @predict.py:126][0m Inference output for ../data/training_data/COCO/train2014/COCO_train2014_000000000009.jpg written to output.png

(2) Other observations, if any:

I ran prediction on many images from the COCO training dataset but there are no results from the line:

    results = predict_image(img, pred_func)

in predict.py.

I checked by making viz.py log a message if there was nothing in the prediction:

def draw_final_outputs(img, results):
    """
    Args:
        results: [DetectionResult]
    """
    if len(results) == 0:
        logger.info("AAAAAAAAAAAAAAAAAAAAAa")
        return img

I removed this bit of code for logs.

3. What you expected, if not obvious.

So I expected that running on the given pretrained models (ImageNet-R50-GroupNorm32-AlignPadding.npz in this case), would be able to do some prediction (even if bad) on the images it trained on for 24 hours. However, there seems to be no output whatsoever for any image I've tried on either computer.

4. Your environment:

GPU cluster

--------------------  -----------------------------------------------------------
sys.platform          linux
Python                3.6.7 (default, Jun 28 2019, 11:58:01) [GCC 5.4.0 20160609]
Tensorpack            v0.9.6-0-g34e8d81
Numpy                 1.16.4
TensorFlow            1.14.0/v1.14.0-rc1-22-gaf24dc91b5
TF Compiler Version   4.8.5
TF CUDA support       True
TF MKL support        False
TF XLA support        False
Nvidia Driver         /usr/lib/nvidia-410/libnvidia-ml.so.410.79
CUDA                  /usr/lib/x86_64-linux-gnu/libcudart.so.7.5.18
CUDNN                 /usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.1
NCCL
CUDA_VISIBLE_DEVICES  None
GPU 0                 TITAN Xp
Free RAM              218.79/251.89 GB
CPU Count             64
cv2                   4.1.0
msgpack               0.6.1
python-prctl          False
--------------------  -----------------------------------------------------------

My laptop:

2019-07-27 12:54:24.190639: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0727 12:54:24.919948 139817241790080 __init__.py:308] Limited tf.compat.v2.summary API due to missing TensorBoard installation.
--------------------  --------------------------------------------------------
sys.platform          linux
Python                3.7.3 (default, Jun 24 2019, 04:54:02) [GCC 9.1.0]
Tensorpack            v0.9.6-3-g242dc71c-dirty
Numpy                 1.16.4
TensorFlow            1.14.0/unknown
TF Compiler Version   8.3.0
TF CUDA support       True
TF MKL support        False
TF XLA support        False
Nvidia Driver         /usr/lib/libnvidia-ml.so.430.34
CUDA                  /opt/cuda/targets/x86_64-linux/lib/libcudart.so.10.1.168
CUDNN                 /usr/lib/libcudnn.so.7.6.1
NCCL                  /usr/lib/libnccl.so.2.4.8
CUDA_VISIBLE_DEVICES  None
GPU 0                 GeForce GTX 1050
Free RAM              7.22/15.52 GB
CPU Count             8
cv2                   4.1.0
msgpack               0.6.1
python-prctl          True
--------------------  --------------------------------------------------------

Although I trained for a day, I did notice that the logs said ~7 days was expected for training to complete. Is that really what's required to get any sort of predictions at all? I just want to make sure the example is working.

ppwwyyxx commented 5 years ago

The README clearly says that you need to pass in the correct config items that are used during training, which you seem to miss. If you did not change any config in training, you should not load the model ImageNet-R50-GroupNorm32-AlignPadding.npz at all because it needs a different set of configs.

ghost commented 5 years ago

Sorry for not elaborating on what my configuration is, I think it's best to just paste anything I changed here:

_C.MODE_MASK = False  # FasterRCNN or MaskRCNN

_C.DATA.BASEDIR = ".../data/training_data/COCO"
_C.BACKBONE.WEIGHTS = ".../data/weights/ImageNet-R50-GroupNorm32-AlignPadding.npz"

Btw I used absolute paths but shortened them above.

So I'm pretty sure my config was not changed between training and prediction.

But I see what you are saying, is this (from the README):

MODE_FPN=True
FPN.NORM=GN
BACKBONE.NORM=GN
FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_gn_head
FPN.MRCNN_HEAD_FUNC=maskrcnn_up4conv_gn_head
TRAIN.LR_SCHEDULE=[240000,320000,360000]

what you mean by needing a different set of configs? Minus the FPN stuff?

ppwwyyxx commented 5 years ago

Since you load a GroupNorm backbone, at least you have to set BACKBONE.NORM=GN. Loading weights from one model to a different model will usually produce garbage outputs.

Whether you want to change other configs is up to you. But at least this will give you a valid training setting.

You can also start with other backbones in the model zoo that does not use GroupNorm.

ghost commented 5 years ago

Thank you so much for your help.

ppwwyyxx commented 5 years ago

Whether you want to change other configs is up to you.

Despite of this, if you're not very familiar with the models, it would be better to use one of the reasonable configs in the table instead of making up a new one.

ghost commented 5 years ago

When you pointed out the weights I was incorrectly using, I suddenly realized what "GN" meant, and the table also became very clear to me. Not sure if necessary for most, but it would be nice for newbies like me if that was mentioned in the README.