FasterRCNN example - prediction yields no results? #1289

Closed ghost closed 5 years ago

ghost commented 5 years ago

1. What you did:

(1) If you're using examples, what's the command you run: python predict.py --predict ../data/training_data/COCO/train2014/COCO_train2014_000000000009.jpg --load ../data/tensorpack_logs/checkpoint

(2) If you're using examples, have you made any changes to the examples? Paste git status; git diff here:

I used the FasterRCNN example as of 242dc71cafb9642e68a2bfb58bcf6ad45ccbb35c, only changing the directories.

2. What you observed:

Logs from GPU cluster I trained on

: cannot connect to X server

Logs from my laptop

(2) Other observations, if any:

I ran prediction on many images from the COCO training dataset but there are no results from the line:

    results = predict_image(img, pred_func)

in predict.py.

I checked by making viz.py log a message if there was nothing in the prediction:

def draw_final_outputs(img, results):
        results: [DetectionResult]
    if len(results) == 0:
        return img

I removed this bit of code for logs.

3. What you expected, if not obvious.

So I expected that running on the given pretrained models (ImageNet-R50-GroupNorm32-AlignPadding.npz in this case), would be able to do some prediction (even if bad) on the images it trained on for 24 hours. However, there seems to be no output whatsoever for any image I've tried on either computer.

4. Your environment:

GPU cluster

--------------------  -----------------------------------------------------------
sys.platform          linux
Python                3.6.7 (default, Jun 28 2019, 11:58:01) [GCC 5.4.0 20160609]
Tensorpack            v0.9.6-0-g34e8d81
Numpy                 1.16.4
TensorFlow            1.14.0/v1.14.0-rc1-22-gaf24dc91b5
TF Compiler Version   4.8.5
TF CUDA support       True
TF MKL support        False
TF XLA support        False
Nvidia Driver         /usr/lib/nvidia-410/libnvidia-ml.so.410.79
CUDA                  /usr/lib/x86_64-linux-gnu/libcudart.so.7.5.18
CUDNN                 /usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.1
GPU 0                 TITAN Xp
Free RAM              218.79/251.89 GB
CPU Count             64
cv2                   4.1.0
msgpack               0.6.1
python-prctl          False
--------------------  -----------------------------------------------------------

My laptop:

2019-07-27 12:54:24.190639: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0727 12:54:24.919948 139817241790080 __init__.py:308] Limited tf.compat.v2.summary API due to missing TensorBoard installation.
--------------------  --------------------------------------------------------
sys.platform          linux
Python                3.7.3 (default, Jun 24 2019, 04:54:02) [GCC 9.1.0]
Tensorpack            v0.9.6-3-g242dc71c-dirty
Numpy                 1.16.4
TensorFlow            1.14.0/unknown
TF Compiler Version   8.3.0
TF CUDA support       True
TF MKL support        False
TF XLA support        False
Nvidia Driver         /usr/lib/libnvidia-ml.so.430.34
CUDA                  /opt/cuda/targets/x86_64-linux/lib/libcudart.so.10.1.168
CUDNN                 /usr/lib/libcudnn.so.7.6.1
NCCL                  /usr/lib/libnccl.so.2.4.8
GPU 0                 GeForce GTX 1050
Free RAM              7.22/15.52 GB
CPU Count             8
cv2                   4.1.0
msgpack               0.6.1
python-prctl          True
--------------------  --------------------------------------------------------

Although I trained for a day, I did notice that the logs said ~7 days was expected for training to complete. Is that really what's required to get any sort of predictions at all? I just want to make sure the example is working.

ppwwyyxx commented 5 years ago

The README clearly says that you need to pass in the correct config items that are used during training, which you seem to miss. If you did not change any config in training, you should not load the model ImageNet-R50-GroupNorm32-AlignPadding.npz at all because it needs a different set of configs.

ghost commented 5 years ago

Sorry for not elaborating on what my configuration is, I think it's best to just paste anything I changed here:

_C.MODE_MASK = False  # FasterRCNN or MaskRCNN

_C.DATA.BASEDIR = ".../data/training_data/COCO"
_C.BACKBONE.WEIGHTS = ".../data/weights/ImageNet-R50-GroupNorm32-AlignPadding.npz"

Btw I used absolute paths but shortened them above.

So I'm pretty sure my config was not changed between training and prediction.

But I see what you are saying, is this (from the README):


what you mean by needing a different set of configs? Minus the FPN stuff?

ppwwyyxx commented 5 years ago

Since you load a GroupNorm backbone, at least you have to set BACKBONE.NORM=GN. Loading weights from one model to a different model will usually produce garbage outputs.

Whether you want to change other configs is up to you. But at least this will give you a valid training setting.

You can also start with other backbones in the model zoo that does not use GroupNorm.

ghost commented 5 years ago

Thank you so much for your help.

ppwwyyxx commented 5 years ago

Whether you want to change other configs is up to you.

Despite of this, if you're not very familiar with the models, it would be better to use one of the reasonable configs in the table instead of making up a new one.

ghost commented 5 years ago

When you pointed out the weights I was incorrectly using, I suddenly realized what "GN" meant, and the table also became very clear to me. Not sure if necessary for most, but it would be nice for newbies like me if that was mentioned in the README.