Open FCInter opened 5 years ago
I think the warping sequence before or after embedding don't matter. Because the warping operating do not contain any learning parameters. My personal opinion,hope to help U.
According to my test case, I'm afraid it really matters, because when I build the training network and load the test checkpoint, the model does not converge very well. Moreover, though the warping operation does not contain parameters, it changes the feature map. That is, performing warping first and then embedding, yields very different feature map, compared with embedding first and then warping.
Really? I have train and test the network, it works well...... Could U show your logs?
@AresGao I have updated the issue. I posted the printed logs during training process. The problem is that I cannot get good results when I continue to train from the demo checkpoint provided in the README. The demo checkpoint yields very good results, but when I continue training from this checkpoint, the results become terrible. Though I only trained for 4k iterations, I would believe that, since the initial checkpoint is pretty good, I do not need to train it for that many iterations. BTW, I'm curious about why we are suggested to train the model from the checkpoint of ResNet-101 and FlowNet, instead of directly train from the demo checkpoint? I also tried to train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even worse.
Thank you for your patience and kindness in helping me!
That's really strange, I train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even better, here is test results. motion [0.0 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7648motion [0.0 0.7], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.5727motion [0.7 0.9], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7515motion [0.9 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.8444.
@AresGao Can you help me ?
I have a problem about "sh ./init.sh".
Traceback (most recent call last):
File "setup_linux.py", line 63, in
@AresGao What version of mxnet are you using? I was wondering if it's caused by the version, since I got a bug because of the wrong version I was using.
I use the latest version of mxnet @FCInter
@txf201604 the func locate_cuda() finds where your cuda installed, I thought u might check your cuda location
def locate_cuda():
"""Locate the CUDA environment on the system
Returns a dict with keys 'home', 'nvcc', 'include', and 'lib64'
and values giving the absolute path to each directory.
Starts by looking for the CUDAHOME env variable. If not found, everything
is based on finding 'nvcc' in the PATH.
"""
First of all, I would like to express my sincere gratitude to you for replying to my email. I have already solved the problem of "sh ./init.sh". It is due to the python version. I installed python2.7 to solve the problem, but I have a new problem. I have not solved it correctly. . The problem is that when I modified the "USE_CUDA = 1 USE_CUDA_PATH = /usr/local/cudamxnet" in the config.mk under make, then the mxnet GPU compilation has a problem. As shown below. "In file included from src/operator/tensor/././sort_op.h:85:0, from src/operator/tensor/./indexing_op.h:24, from src/operator/tensor/indexing_op.cu:8: src/operator/tensor/./././sort_op-inl.cuh:15:44: fatal error: cub/device/device_radix_sort.cuh: No such file or directory
^
compilation terminated. Makefile:211: recipe for target 'build/src/operator/tensor/indexing_op_gpu.o' failed make: [build/src/operator/tensor/indexing_op_gpu.o] Error 1 make: Waiting for unfinished jobs...." Mxnet cpu version I can compile successfully, if you can not support the GPU, when running "demo.py" will report the following error: "Stack trace returned 10 entries: [bt] (0) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7febb91c2ebc] [bt] (1) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(MXImperativeInvoke+0x8c9) [0x7febb9f6d6e9] [bt] (2) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7feb6c623ec0] [bt] (3) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7feb6c62387d] [bt] (4) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4de) [0x7feb6c83a8de] [bt] (5) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(+0x9b31) [0x7feb6c830b31] [bt] (6) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7febc057b973] [bt] (7) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9) [0x7febc0611d49] [bt] (8) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9) [0x7febc06176c9] [bt] (9) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6a08) [0x7febc0614b98]
Traceback (most recent call last):
File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/demo.py",
line 257, in
高志华 notifications@github.com 于2018年11月13日周二 下午6:33写道:
@txf201604 https://github.com/txf201604 the func locate_cuda() finds where your cuda installed, I thought u might check your cuda location
def locate_cuda(): """Locate the CUDA environment on the system Returns a dict with keys 'home', 'nvcc', 'include', and 'lib64' and values giving the absolute path to each directory. Starts by looking for the CUDAHOME env variable. If not found, everything is based on finding 'nvcc' in the PATH. """
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/msracver/Flow-Guided-Feature-Aggregation/issues/37#issuecomment-438217080, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq0WQqOYkloNmeIsyVSCX5Gh-ceZPfykks5uuqAHgaJpZM4YSTVH .
Hello First of all, I would like to express my sincere gratitude to you for replying to my email. I have already solved the problem of "sh ./init.sh". It is due to the python version. I installed python2.7 to solve the problem, but I have a new problem. I have not solved it correctly. . The problem is that when I modified the "USE_CUDA = 1 USE_CUDA_PATH = /usr/local/cudamxnet" in the config.mk under make, then the mxnet GPU compilation has a problem. As shown below. "In file included from src/operator/tensor/././sort_op.h:85:0, from src/operator/tensor/./indexing_op.h:24, from src/operator/tensor/indexing_op.cu:8: src/operator/tensor/./././sort_op-inl.cuh:15:44: fatal error: cub/device/device_radix_sort.cuh: No such file or directory
^
compilation terminated. Makefile:211: recipe for target 'build/src/operator/tensor/indexing_op_gpu.o' failed make: [build/src/operator/tensor/indexing_op_gpu.o] Error 1 make: Waiting for unfinished jobs...." Mxnet cpu version I can compile successfully, if you can not support the GPU, when running "demo.py" will report the following error: "Stack trace returned 10 entries: [bt] (0) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7febb91c2ebc] [bt] (1) /home/user/anaconda3/envs/py2.7/lib/python2.7/site-packages/mxnet-0.10.0-py2.7.egg/mxnet/libmxnet.so(MXImperativeInvoke+0x8c9) [0x7febb9f6d6e9] [bt] (2) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7feb6c623ec0] [bt] (3) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7feb6c62387d] [bt] (4) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4de) [0x7feb6c83a8de] [bt] (5) /home/user/anaconda3/envs/py2.7/lib/python2.7/lib-dynload/_ctypes.so(+0x9b31) [0x7feb6c830b31] [bt] (6) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7febc057b973] [bt] (7) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9) [0x7febc0611d49] [bt] (8) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9) [0x7febc06176c9] [bt] (9) /home/user/anaconda3/envs/py2.7/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6a08) [0x7febc0614b98]
Traceback (most recent call last):
File "/home/user/tmp/Flow-Guided-Feature-Aggregation/fgfa_rfcn/demo.py",
line 257, in
FCInter notifications@github.com 于2018年11月13日周二 下午6:28写道:
@AresGao https://github.com/AresGao What version of mxnet are you using? I was wondering if it's caused by the version, since I got a bug because of the wrong version I was using.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/msracver/Flow-Guided-Feature-Aggregation/issues/37#issuecomment-438215491, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq0WQpaPN8JFZjelrLU5ECPgt79mOR9wks5uup7JgaJpZM4YSTVH .
@AresGao Finally I got good results after training for 2 complete epoch!!!
I just have one last question. I find that when saving checkpoint at the end of each epoch, the following codes are used to create two new weights, namely rfcn_bbox_weight_test
and rfcn_bbox_bias_test
.
arg['rfcn_bbox_weight_test'] = weight * mx.nd.repeat(mx.nd.array(stds), repeats=repeat).reshape((bias.shape[0], 1, 1, 1))
arg['rfcn_bbox_bias_test'] = arg['rfcn_bbox_bias'] * mx.nd.repeat(mx.nd.array(stds), repeats=repeat) + mx.nd.repeat(mx.nd.array(means), repeats=repeat)
Why do we need to do this? I have tested that if I do not do this, the checkpoint will make terrible predictions on the test data. This is also the reason why my previous predictions are bad, though the training loss looked good. Thank you!
Hi, @AresGao @FCInter @YuwenXiong , I tried the training and inference of the code. I used 4 gpus and all the setting is not changed, and the final mAP is 75.78. I am confused about the drop of mAP. Do you change any setting or do you have some advice on my case?
That's really strange, I train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even better, here is test results. motion [0.0 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7648motion [0.0 0.7], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.5727motion [0.7 0.9], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7515motion [0.9 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.8444.
@AresGao Hi~ I just have one GPU(1080TI), and only get mAP=0.7389 test by default setting. How can you get better mAP? Can you tell us your setting detail? Such as epochs, min_diff/max_diff, lr, gpus, test key_frame and so on... Thank you!
@Feywell Hi Feywell, I have test the default setting with 2 GPU and 4 GPU, the result of 4 GPU is much better than 2. PS. lr = 0.00025 is equivalent to paper described 0.001. You could find more details in their code.
@withinnoitatpmet Thank you! So, if I just have one GPU , setting lr = 0.001, it will be better?
@Feywell I think the result could be even worse. Considering the relation between batch size and lr (idk if it is valid for small this batch size), lr should be 0.00025.
That's really strange, I train from the checkpoint of ResNet-101 and FlowNet for 100k+ iterations, the performance was even better, here is test results. motion [0.0 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7648motion [0.0 0.7], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.5727motion [0.7 0.9], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.7515motion [0.9 1.0], area [0.0 0.0 100000.0 100000.0]Mean AP@0.5 = 0.8444.
hi,i want to know how many epochs exactly you set to train the model, i train this model for 2 epochs,and get a result aboult 73.16% ,and why paper always talk about iterration not epochs, i wish to hearing from you ,thank you
@AresGao hi,i want to know how many epochs exactly you set to train the model, i train this model for 2 epochs,and get a result aboult 73.16% ,and why paper always talk about iterration not epochs, i wish to hearing from you ,thank you
@AresGao Finally I got good results after training for 2 complete epoch!!!
I just have one last question. I find that when saving checkpoint at the end of each epoch, the following codes are used to create two new weights, namely
rfcn_bbox_weight_test
andrfcn_bbox_bias_test
.arg['rfcn_bbox_weight_test'] = weight * mx.nd.repeat(mx.nd.array(stds), repeats=repeat).reshape((bias.shape[0], 1, 1, 1)) arg['rfcn_bbox_bias_test'] = arg['rfcn_bbox_bias'] * mx.nd.repeat(mx.nd.array(stds), repeats=repeat) + mx.nd.repeat(mx.nd.array(means), repeats=repeat)
Why do we need to do this? I have tested that if I do not do this, the checkpoint will make terrible predictions on the test data. This is also the reason why my previous predictions are bad, though the training loss looked good. Thank you!
Hi, @FCInter Do you know why there is arg['rfcn_bbox_weight_test'] here? I try to change detection network to light-head, so I do not keep the arg['rfcn_bbox_weight_test'] . but I get a bad result. Do you know what the meaning of arg['rfcn_bbox_weight_test'] is?
I read the code and find that, the training and the test networks are different in structure. In particular, in training network, the embedding layers are added after the concatenation of the output of ResNet-101 and FlowNet, as is shown in the following code:
However, in the test network, the structure becomes very different. The embedding layers are added only to the output of ResNet-101. And then the output of the embedding layers are concatenated with the output of FlowNet, as is shown in the following code:
Why are the networks so different between training and test?
Moreover, when I'm training the network, the training Acc stays high, like 0.98, but when I test the model on demo images, I get a terrible result.
I clone the project and configure all the data and model paths following the instructions. I tried to continue to train from the model for demo, i.e.
./model/rfcn_fgfa_flownet_vid-0000.params
which is downloaded from the OneDrive url provided in the README.
The logs looks normal, as follows:
But when I run the model checkpoint on the demo images, as used in the demo.py, I get tens of boxes overlapping each other, and I can hardly figure out which one is good. But as in groundtruth, there are only 7 planes in each image. This is strange because even if I use the initial checkpoint provided for demo, I get good detection results, with 7 bounding boxes each covering a plane perfectly.
What's wrong with my training?
The only changes I make to the code are the model and data paths, as well as gpu ids.