msracver / Flow-Guided-Feature-Aggregation

Flow-Guided Feature Aggregation for Video Object Detection
MIT License
723 stars 190 forks source link

Why is the test network different from the training network? #37

Open FCInter opened 5 years ago

FCInter commented 5 years ago

I read the code and find that, the training and the test networks are different in structure. In particular, in training network, the embedding layers are added after the concatenation of the output of ResNet-101 and FlowNet, as is shown in the following code:

# In function get_train_symbol()
# conv_feat is output of ResNet, warp_conv_feat_1{or 2} is output of FlowNet
concat_embed_data = mx.symbol.Concat(*[conv_feat[0], warp_conv_feat_1, warp_conv_feat_2], dim=0) 
embed_output = self.get_embednet(concat_embed_data) 

However, in the test network, the structure becomes very different. The embedding layers are added only to the output of ResNet-101. And then the output of the embedding layers are concatenated with the output of FlowNet, as is shown in the following code:

# In function get_feat_symbol()
conv_feat = self.get_resnet_v1(data)
embed_feat = self.get_embednet(conv_feat) # embedding network is added right after ResNet
conv_embed = mx.sym.Concat(conv_feat, embed_feat, name="conv_embed")
... # some codes
# In function get_aggregation_symbol()
# flow is the output of FlowNet, feat_cache is the output of embedding layers
flow_grid = mx.sym.GridGenerator(data=flow, transform_type='warp', name='flow_grid')
conv_feat = mx.sym.BilinearSampler(data=feat_cache, grid=flow_grid, name='warping_feat')
embed_output = mx.symbol.slice_axis(conv_feat, axis=1, begin=1024, end=3072)

Why are the networks so different between training and test?

Moreover, when I'm training the network, the training Acc stays high, like 0.98, but when I test the model on demo images, I get a terrible result.

I clone the project and configure all the data and model paths following the instructions. I tried to continue to train from the model for demo, i.e.

./model/rfcn_fgfa_flownet_vid-0000.params

which is downloaded from the OneDrive url provided in the README.

The logs looks normal, as follows:

Epoch[0] Batch [100]    Speed: 1.04 samples/sec Train-RPNAcc=0.938544,  RPNLogLoss=0.207737,    RPNL1Loss=0.354940,     RCNNAcc=0.842822,       RCNNLogLoss=1.380372,   RCNNL1Loss=0.245006,
Epoch[0] Batch [200]    Speed: 1.04 samples/sec Train-RPNAcc=0.954699,  RPNLogLoss=0.151813,    RPNL1Loss=0.258118,     RCNNAcc=0.815532,       RCNNLogLoss=1.306230,   RCNNL1Loss=0.368251,
Epoch[0] Batch [300]    Speed: 1.05 samples/sec Train-RPNAcc=0.963273,  RPNLogLoss=0.122710,    RPNL1Loss=0.220691,     RCNNAcc=0.805233,       RCNNLogLoss=1.272574,   RCNNL1Loss=0.387604,
Epoch[0] Batch [400]    Speed: 1.07 samples/sec Train-RPNAcc=0.967464,  RPNLogLoss=0.108166,    RPNL1Loss=0.203036,     RCNNAcc=0.800499,       RCNNLogLoss=1.245192,   RCNNL1Loss=0.370216,
Epoch[0] Batch [500]    Speed: 1.05 samples/sec Train-RPNAcc=0.971533,  RPNLogLoss=0.095719,    RPNL1Loss=0.188637,     RCNNAcc=0.796875,       RCNNLogLoss=1.218545,   RCNNL1Loss=0.347318,
Epoch[0] Batch [600]    Speed: 1.07 samples/sec Train-RPNAcc=0.974242,  RPNLogLoss=0.087136,    RPNL1Loss=0.177525,     RCNNAcc=0.797629,       RCNNLogLoss=1.187403,   RCNNL1Loss=0.326912,
Epoch[0] Batch [700]    Speed: 1.07 samples/sec Train-RPNAcc=0.976512,  RPNLogLoss=0.079193,    RPNL1Loss=0.166786,     RCNNAcc=0.799628,       RCNNLogLoss=1.147833,   RCNNL1Loss=0.308086,
Epoch[0] Batch [800]    Speed: 1.06 samples/sec Train-RPNAcc=0.977943,  RPNLogLoss=0.074186,    RPNL1Loss=0.164597,     RCNNAcc=0.800503,       RCNNLogLoss=1.122886,   RCNNL1Loss=0.292854,
Epoch[0] Batch [900]    Speed: 1.07 samples/sec Train-RPNAcc=0.978635,  RPNLogLoss=0.071338,    RPNL1Loss=0.158227,     RCNNAcc=0.798418,       RCNNLogLoss=1.113350,   RCNNL1Loss=0.283827,
Epoch[0] Batch [1000]   Speed: 1.05 samples/sec Train-RPNAcc=0.979403,  RPNLogLoss=0.068680,    RPNL1Loss=0.151551,     RCNNAcc=0.800332,       RCNNLogLoss=1.086266,   RCNNL1Loss=0.271887,
Epoch[0] Batch [1100]   Speed: 1.07 samples/sec Train-RPNAcc=0.980064,  RPNLogLoss=0.066033,    RPNL1Loss=0.149408,     RCNNAcc=0.798940,       RCNNLogLoss=1.076426,   RCNNL1Loss=0.264026,
Epoch[0] Batch [1200]   Speed: 1.06 samples/sec Train-RPNAcc=0.980735,  RPNLogLoss=0.063761,    RPNL1Loss=0.144844,     RCNNAcc=0.797968,       RCNNLogLoss=1.063309,   RCNNL1Loss=0.260105,
Epoch[0] Batch [1300]   Speed: 1.05 samples/sec Train-RPNAcc=0.981171,  RPNLogLoss=0.062085,    RPNL1Loss=0.142027,     RCNNAcc=0.798472,       RCNNLogLoss=1.046781,   RCNNL1Loss=0.254449,
Epoch[0] Batch [1400]   Speed: 1.07 samples/sec Train-RPNAcc=0.981467,  RPNLogLoss=0.061773,    RPNL1Loss=0.138603,     RCNNAcc=0.801297,       RCNNLogLoss=1.017234,   RCNNL1Loss=0.248268,
Epoch[0] Batch [1500]   Speed: 1.06 samples/sec Train-RPNAcc=0.981986,  RPNLogLoss=0.060457,    RPNL1Loss=0.135045,     RCNNAcc=0.803818,       RCNNLogLoss=0.991378,   RCNNL1Loss=0.243015,
Epoch[0] Batch [1600]   Speed: 1.05 samples/sec Train-RPNAcc=0.982362,  RPNLogLoss=0.059222,    RPNL1Loss=0.132028,     RCNNAcc=0.805146,       RCNNLogLoss=0.970428,   RCNNL1Loss=0.242290,
Epoch[0] Batch [1700]   Speed: 1.05 samples/sec Train-RPNAcc=0.982717,  RPNLogLoss=0.058142,    RPNL1Loss=0.131539,     RCNNAcc=0.808362,       RCNNLogLoss=0.943044,   RCNNL1Loss=0.237796,
Epoch[0] Batch [1800]   Speed: 1.07 samples/sec Train-RPNAcc=0.982972,  RPNLogLoss=0.057485,    RPNL1Loss=0.130012,     RCNNAcc=0.810964,       RCNNLogLoss=0.919282,   RCNNL1Loss=0.234166,
Epoch[0] Batch [1900]   Speed: 1.07 samples/sec Train-RPNAcc=0.983284,  RPNLogLoss=0.056381,    RPNL1Loss=0.127654,     RCNNAcc=0.813182,       RCNNLogLoss=0.898093,   RCNNL1Loss=0.231330,
Epoch[0] Batch [2000]   Speed: 1.06 samples/sec Train-RPNAcc=0.983577,  RPNLogLoss=0.055477,    RPNL1Loss=0.125975,     RCNNAcc=0.816303,       RCNNLogLoss=0.874262,   RCNNL1Loss=0.228761,
Epoch[0] Batch [2100]   Speed: 1.06 samples/sec Train-RPNAcc=0.983429,  RPNLogLoss=0.055541,    RPNL1Loss=0.124450,     RCNNAcc=0.818520,       RCNNLogLoss=0.854937,   RCNNL1Loss=0.227023,
Epoch[0] Batch [2200]   Speed: 1.06 samples/sec Train-RPNAcc=0.983559,  RPNLogLoss=0.055248,    RPNL1Loss=0.123187,     RCNNAcc=0.821203,       RCNNLogLoss=0.835192,   RCNNL1Loss=0.223909,
Epoch[0] Batch [2300]   Speed: 1.06 samples/sec Train-RPNAcc=0.983856,  RPNLogLoss=0.054266,    RPNL1Loss=0.122544,     RCNNAcc=0.822937,       RCNNLogLoss=0.818458,   RCNNL1Loss=0.222563,
Epoch[0] Batch [2400]   Speed: 1.06 samples/sec Train-RPNAcc=0.983945,  RPNLogLoss=0.054052,    RPNL1Loss=0.120982,     RCNNAcc=0.823651,       RCNNLogLoss=0.804805,   RCNNL1Loss=0.222604,
Epoch[0] Batch [2500]   Speed: 1.06 samples/sec Train-RPNAcc=0.984273,  RPNLogLoss=0.053099,    RPNL1Loss=0.119158,     RCNNAcc=0.824958,       RCNNLogLoss=0.790649,   RCNNL1Loss=0.220705,
Epoch[0] Batch [2600]   Speed: 1.07 samples/sec Train-RPNAcc=0.984386,  RPNLogLoss=0.052425,    RPNL1Loss=0.118785,     RCNNAcc=0.827017,       RCNNLogLoss=0.774911,   RCNNL1Loss=0.218256,
Epoch[0] Batch [2700]   Speed: 1.05 samples/sec Train-RPNAcc=0.984553,  RPNLogLoss=0.051816,    RPNL1Loss=0.117867,     RCNNAcc=0.827972,       RCNNLogLoss=0.762744,   RCNNL1Loss=0.217780,
Epoch[0] Batch [2800]   Speed: 1.06 samples/sec Train-RPNAcc=0.984407,  RPNLogLoss=0.052380,    RPNL1Loss=0.118125,     RCNNAcc=0.829447,       RCNNLogLoss=0.749248,   RCNNL1Loss=0.216280,
Epoch[0] Batch [2900]   Speed: 1.06 samples/sec Train-RPNAcc=0.984538,  RPNLogLoss=0.051778,    RPNL1Loss=0.116990,     RCNNAcc=0.831085,       RCNNLogLoss=0.736197,   RCNNL1Loss=0.214425,
Epoch[0] Batch [3000]   Speed: 1.07 samples/sec Train-RPNAcc=0.984686,  RPNLogLoss=0.051383,    RPNL1Loss=0.116145,     RCNNAcc=0.832022,       RCNNLogLoss=0.726220,   RCNNL1Loss=0.214046,
Epoch[0] Batch [3100]   Speed: 1.05 samples/sec Train-RPNAcc=0.984793,  RPNLogLoss=0.051247,    RPNL1Loss=0.115204,     RCNNAcc=0.833453,       RCNNLogLoss=0.715207,   RCNNL1Loss=0.212304,
Epoch[0] Batch [3200]   Speed: 1.06 samples/sec Train-RPNAcc=0.984917,  RPNLogLoss=0.050870,    RPNL1Loss=0.114867,     RCNNAcc=0.834683,       RCNNLogLoss=0.703953,   RCNNL1Loss=0.210789,
Epoch[0] Batch [3300]   Speed: 1.06 samples/sec Train-RPNAcc=0.984955,  RPNLogLoss=0.050687,    RPNL1Loss=0.114899,     RCNNAcc=0.835670,       RCNNLogLoss=0.694319,   RCNNL1Loss=0.209543,
Epoch[0] Batch [3400]   Speed: 1.06 samples/sec Train-RPNAcc=0.984918,  RPNLogLoss=0.050745,    RPNL1Loss=0.114710,     RCNNAcc=0.836985,       RCNNLogLoss=0.684625,   RCNNL1Loss=0.208034,
Epoch[0] Batch [3500]   Speed: 1.07 samples/sec Train-RPNAcc=0.984995,  RPNLogLoss=0.050442,    RPNL1Loss=0.113783,     RCNNAcc=0.838238,       RCNNLogLoss=0.675153,   RCNNL1Loss=0.206618,
Epoch[0] Batch [3600]   Speed: 1.07 samples/sec Train-RPNAcc=0.985046,  RPNLogLoss=0.050316,    RPNL1Loss=0.113588,     RCNNAcc=0.839296,       RCNNLogLoss=0.666437,   RCNNL1Loss=0.205760,
Epoch[0] Batch [3700]   Speed: 1.07 samples/sec Train-RPNAcc=0.984950,  RPNLogLoss=0.050420,    RPNL1Loss=0.113458,     RCNNAcc=0.840301,       RCNNLogLoss=0.658199,   RCNNL1Loss=0.204153,
Epoch[0] Batch [3800]   Speed: 1.07 samples/sec Train-RPNAcc=0.985015,  RPNLogLoss=0.050012,    RPNL1Loss=0.113324,     RCNNAcc=0.841688,       RCNNLogLoss=0.649121,   RCNNL1Loss=0.202450,
Epoch[0] Batch [3900]   Speed: 1.07 samples/sec Train-RPNAcc=0.985168,  RPNLogLoss=0.049591,    RPNL1Loss=0.112202,     RCNNAcc=0.842604,       RCNNLogLoss=0.642186,   RCNNL1Loss=0.202128,
Epoch[0] Batch [4000]   Speed: 1.06 samples/sec Train-RPNAcc=0.985318,  RPNLogLoss=0.049009,    RPNL1Loss=0.111678,     RCNNAcc=0.843418,       RCNNLogLoss=0.635388,   RCNNL1Loss=0.201368,
Epoch[0] Batch [4100]   Speed: 1.06 samples/sec Train-RPNAcc=0.985317,  RPNLogLoss=0.048885,    RPNL1Loss=0.111040,     RCNNAcc=0.844182,       RCNNLogLoss=0.628305,   RCNNL1Loss=0.200417,
Epoch[0] Batch [4200]   Speed: 1.06 samples/sec Train-RPNAcc=0.985169,  RPNLogLoss=0.049311,    RPNL1Loss=0.111755,     RCNNAcc=0.844617,       RCNNLogLoss=0.622948,   RCNNL1Loss=0.200082,
Epoch[0] Batch [4300]   Speed: 0.72 samples/sec Train-RPNAcc=0.985110,  RPNLogLoss=0.049333,    RPNL1Loss=0.112316,     RCNNAcc=0.845354,       RCNNLogLoss=0.616863,   RCNNL1Loss=0.199422,

But when I run the model checkpoint on the demo images, as used in the demo.py, I get tens of boxes overlapping each other, and I can hardly figure out which one is good. But as in groundtruth, there are only 7 planes in each image. This is strange because even if I use the initial checkpoint provided for demo, I get good detection results, with 7 bounding boxes each covering a plane perfectly.

What's wrong with my training?

The only changes I make to the code are the model and data paths, as well as gpu ids.

ZhihuaGao commented 5 years ago

I think the warping sequence before or after embedding don't matter. Because the warping operating do not contain any learning parameters. My personal opinion,hope to help U.

FCInter commented 5 years ago

According to my test case, I'm afraid it really matters, because when I build the training network and load the test checkpoint, the model does not converge very well. Moreover, though the warping operation does not contain parameters, it changes the feature map. That is, performing warping first and then embedding, yields very different feature map, compared with embedding first and then warping.

ZhihuaGao commented 5 years ago

Really? I have train and test the network, it works well...... Could U show your logs?

FCInter commented 5 years ago

@AresGao I have updated the issue. I posted the printed logs during training process. The problem is that I cannot get good results when I continue to train from the demo checkpoint provided in the README. The demo checkpoint yields very good results, but when I continue training from this checkpoint, the results become terrible. Though I only trained for 4k iterations, I would believe that, since the initial checkpoint is pretty good, I do not need to train it for that many iterations. BTW, I'm curious about why we are suggested to train the model from the checkpoint of ResNet-101 and FlowNet, instead of directly train from the demo checkpoint? I also tried to train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even worse.

Thank you for your patience and kindness in helping me!

ZhihuaGao commented 5 years ago

That's really strange, I train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even better, here is test results. motion [0.0 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7648motion [0.0 0.7], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.5727motion [0.7 0.9], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7515motion [0.9 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.8444.

txf201604 commented 5 years ago

@AresGao Can you help me ? I have a problem about "sh ./init.sh". Traceback (most recent call last): File "setup_linux.py", line 63, in CUDA = locate_cuda() File "setup_linux.py", line 58, in locate_cuda for k, v in cudaconfig.iteritems(): AttributeError: 'dict' object has no attribute 'iteritems' If youIf you can reply me in time, I will be very grateful.

FCInter commented 5 years ago

@AresGao What version of mxnet are you using? I was wondering if it's caused by the version, since I got a bug because of the wrong version I was using.

ZhihuaGao commented 5 years ago

I use the latest version of mxnet @FCInter

ZhihuaGao commented 5 years ago

@txf201604 the func locate_cuda() finds where your cuda installed, I thought u might check your cuda location

def locate_cuda():
    """Locate the CUDA environment on the system
    Returns a dict with keys 'home', 'nvcc', 'include', and 'lib64'
    and values giving the absolute path to each directory.
    Starts by looking for the CUDAHOME env variable. If not found, everything
    is based on finding 'nvcc' in the PATH.
    """
txf201604 commented 5 years ago

First of all, I would like to express my sincere gratitude to you for replying to my email. I have already solved the problem of "sh ./init.sh". It is due to the python version. I installed python2.7 to solve the problem, but I have a new problem. I have not solved it correctly. . The problem is that when I modified the "USE_CUDA = 1 USE_CUDA_PATH = /usr/local/cudamxnet" in the config.mk under make, then the mxnet GPU compilation has a problem. As shown below. "In file included from src/operator/tensor/././sort_op.h:85:0, from src/operator/tensor/./indexing_op.h:24, from src/operator/tensor/indexing_op.cu:8: src/operator/tensor/./././sort_op-inl.cuh:15:44: fatal error: cub/device/device_radix_sort.cuh: No such file or directory

include <cub/device/device_radix_sort.cuh>

                                        ^

compilation terminated. Makefile:211: recipe for target 'build/src/operator/tensor/indexing_op_gpu.o' failed make: [build/src/operator/tensor/indexing_op_gpu.o] Error 1 make: Waiting for unfinished jobs...." Mxnet cpu version I can compile successfully, if you can not support the GPU, when running "demo.py" will report the following error: "Stack trace returned 10 entries: [bt] (0) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7febb91c2ebc] [bt] (1) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(MXImperativeInvoke+0x8c9) [0x7febb9f6d6e9] [bt] (2) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7feb6c623ec0] [bt] (3) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7feb6c62387d] [bt] (4) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4de) [0x7feb6c83a8de] [bt] (5) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(+0x9b31) [0x7feb6c830b31] [bt] (6) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7febc057b973] [bt] (7) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9) [0x7febc0611d49] [bt] (8) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9) [0x7febc06176c9] [bt] (9) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6a08) [0x7febc0614b98]

Traceback (most recent call last): File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/demo.py", line 257, in main() File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/demo.py", line 155, in main arg_params=arg_params, aux_params=aux_params) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/tester.py", line 37, in init self._mod.bind(provide_data, provide_label, for_training=False)" File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/module.py", line 844, in bind for_training, inputs_need_grad, force_rebind=False, shared_module=None) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/module.py", line 401, in bind state_names=self._state_names) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 191, in init self.bind_exec(data_shapes, label_shapes, shared_group) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 277, in bind_exec shared_group)) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 550, in _bind_ith_exec context, self.logger) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 528, in _get_or_reshape arg_arr = nd.zeros(arg_shape, context, dtype=arg_type) File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/ndarray.py", line 1003, in zeros return _internal._zeros(shape=shape, ctx=ctx, dtype=dtype) File "", line 15, in _zeros File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/_ctypes/ndarray.py", line 72, in _imperative_invoke c_array(ctypes.c_char_p, [c_str(str(val)) for val in vals]))) File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/base.py", line 84, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [10:59:41] src/c_api/c_api_ndarray.cc:392: Operator _zeros cannot be run; requires at least one of FCompute, NDArrayFunction, FCreateOperator be registered

高志华 notifications@github.com 于2018年11月13日周二 下午6:33写道:

@txf201604 https://github.com/txf201604 the func locate_cuda() finds where your cuda installed, I thought u might check your cuda location

def locate_cuda(): """Locate the CUDA environment on the system Returns a dict with keys 'home', 'nvcc', 'include', and 'lib64' and values giving the absolute path to each directory. Starts by looking for the CUDAHOME env variable. If not found, everything is based on finding 'nvcc' in the PATH. """

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/msracver/Flow-Guided-Feature-Aggregation/issues/37#issuecomment-438217080, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq0WQqOYkloNmeIsyVSCX5Gh-ceZPfykks5uuqAHgaJpZM4YSTVH .

txf201604 commented 5 years ago

Hello First of all, I would like to express my sincere gratitude to you for replying to my email. I have already solved the problem of "sh ./init.sh". It is due to the python version. I installed python2.7 to solve the problem, but I have a new problem. I have not solved it correctly. . The problem is that when I modified the "USE_CUDA = 1 USE_CUDA_PATH = /usr/local/cudamxnet" in the config.mk under make, then the mxnet GPU compilation has a problem. As shown below. "In file included from src/operator/tensor/././sort_op.h:85:0, from src/operator/tensor/./indexing_op.h:24, from src/operator/tensor/indexing_op.cu:8: src/operator/tensor/./././sort_op-inl.cuh:15:44: fatal error: cub/device/device_radix_sort.cuh: No such file or directory

include <cub/device/device_radix_sort.cuh>

                                        ^

compilation terminated. Makefile:211: recipe for target 'build/src/operator/tensor/indexing_op_gpu.o' failed make: [build/src/operator/tensor/indexing_op_gpu.o] Error 1 make: Waiting for unfinished jobs...." Mxnet cpu version I can compile successfully, if you can not support the GPU, when running "demo.py" will report the following error: "Stack trace returned 10 entries: [bt] (0) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7febb91c2ebc] [bt] (1) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(MXImperativeInvoke+0x8c9) [0x7febb9f6d6e9] [bt] (2) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7feb6c623ec0] [bt] (3) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7feb6c62387d] [bt] (4) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4de) [0x7feb6c83a8de] [bt] (5) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(+0x9b31) [0x7feb6c830b31] [bt] (6) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7febc057b973] [bt] (7) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9) [0x7febc0611d49] [bt] (8) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9) [0x7febc06176c9] [bt] (9) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6a08) [0x7febc0614b98]

Traceback (most recent call last): File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/demo.py", line 257, in main() File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/demo.py", line 155, in main arg_params=arg_params, aux_params=aux_params) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/tester.py", line 37, in init self._mod.bind(provide_data, provide_label, for_training=False)" File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/module.py", line 844, in bind for_training, inputs_need_grad, force_rebind=False, shared_module=None) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/module.py", line 401, in bind state_names=self._state_names) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 191, in init self.bind_exec(data_shapes, label_shapes, shared_group) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 277, in bind_exec shared_group)) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 550, in _bind_ith_exec context, self.logger) File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/core/DataParallelExecutorGroup.py", line 528, in _get_or_reshape arg_arr = nd.zeros(arg_shape, context, dtype=arg_type) File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/ndarray.py", line 1003, in zeros return _internal._zeros(shape=shape, ctx=ctx, dtype=dtype) File "", line 15, in _zeros File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/_ctypes/ndarray.py", line 72, in _imperative_invoke c_array(ctypes.c_char_p, [c_str(str(val)) for val in vals]))) File "/home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/base.py", line 84, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [10:59:41] src/c_api/c_api_ndarray.cc:392: Operator _zeros cannot be run; requires at least one of FCompute, NDArrayFunction, FCreateOperator be registered

FCInter notifications@github.com 于2018年11月13日周二 下午6:28写道:

@AresGao https://github.com/AresGao What version of mxnet are you using? I was wondering if it's caused by the version, since I got a bug because of the wrong version I was using.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/msracver/Flow-Guided-Feature-Aggregation/issues/37#issuecomment-438215491, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq0WQpaPN8JFZjelrLU5ECPgt79mOR9wks5uup7JgaJpZM4YSTVH .

FCInter commented 5 years ago

@AresGao Finally I got good results after training for 2 complete epoch!!!

I just have one last question. I find that when saving checkpoint at the end of each epoch, the following codes are used to create two new weights, namely rfcn_bbox_weight_test and rfcn_bbox_bias_test.

arg['rfcn_bbox_weight_test'] = weight * mx.nd.repeat(mx.nd.array(stds), repeats=repeat).reshape((bias.shape[0], 1, 1, 1))
arg['rfcn_bbox_bias_test'] = arg['rfcn_bbox_bias'] * mx.nd.repeat(mx.nd.array(stds), repeats=repeat) + mx.nd.repeat(mx.nd.array(means), repeats=repeat)

Why do we need to do this? I have tested that if I do not do this, the checkpoint will make terrible predictions on the test data. This is also the reason why my previous predictions are bad, though the training loss looked good. Thank you!

samanthawyf commented 5 years ago

Hi, @AresGao @FCInter @YuwenXiong , I tried the training and inference of the code. I used 4 gpus and all the setting is not changed, and the final mAP is 75.78. I am confused about the drop of mAP. Do you change any setting or do you have some advice on my case?

Feywell commented 5 years ago

That's really strange, I train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even better, here is test results. motion [0.0 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7648motion [0.0 0.7], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.5727motion [0.7 0.9], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7515motion [0.9 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.8444.

@AresGao Hi~ I just have one GPU(1080TI), and only get mAP=0.7389 test by default setting. How can you get better mAP? Can you tell us your setting detail? Such as epochs, min_diff/max_diff, lr, gpus, test key_frame and so on... Thank you!

withinnoitatpmet commented 5 years ago

@Feywell Hi Feywell, I have test the default setting with 2 GPU and 4 GPU, the result of 4 GPU is much better than 2. PS. lr = 0.00025 is equivalent to paper described 0.001. You could find more details in their code.

Feywell commented 5 years ago

@withinnoitatpmet Thank you! So, if I just have one GPU , setting lr = 0.001, it will be better?

withinnoitatpmet commented 5 years ago

@Feywell I think the result could be even worse. Considering the relation between batch size and lr (idk if it is valid for small this batch size), lr should be 0.00025.

jucaowei commented 5 years ago

That's really strange, I train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even better, here is test results. motion [0.0 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7648motion [0.0 0.7], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.5727motion [0.7 0.9], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7515motion [0.9 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.8444.

hi,i want to know how many epochs exactly you set to train the model, i train this model for 2 epochs,and get a result aboult 73.16% ,and why paper always talk about iterration not epochs, i wish to hearing from you ,thank you

jucaowei commented 5 years ago

@AresGao hi,i want to know how many epochs exactly you set to train the model, i train this model for 2 epochs,and get a result aboult 73.16% ,and why paper always talk about iterration not epochs, i wish to hearing from you ,thank you

Feywell commented 5 years ago

@AresGao Finally I got good results after training for 2 complete epoch!!!

I just have one last question. I find that when saving checkpoint at the end of each epoch, the following codes are used to create two new weights, namely rfcn_bbox_weight_test and rfcn_bbox_bias_test.

arg['rfcn_bbox_weight_test'] = weight * mx.nd.repeat(mx.nd.array(stds), repeats=repeat).reshape((bias.shape[0], 1, 1, 1))
arg['rfcn_bbox_bias_test'] = arg['rfcn_bbox_bias'] * mx.nd.repeat(mx.nd.array(stds), repeats=repeat) + mx.nd.repeat(mx.nd.array(means), repeats=repeat)

Why do we need to do this? I have tested that if I do not do this, the checkpoint will make terrible predictions on the test data. This is also the reason why my previous predictions are bad, though the training loss looked good. Thank you!

Hi, @FCInter Do you know why there is arg['rfcn_bbox_weight_test'] here? I try to change detection network to light-head, so I do not keep the arg['rfcn_bbox_weight_test'] . but I get a bad result. Do you know what the meaning of arg['rfcn_bbox_weight_test'] is?