Open VinsonXuxuxu opened 6 years ago
Sorry to disturb you again,How to solve this problem?
Hi,
I think this is due to the version of tensorflow. The input format of resnet bottleneck changed after 0.12. You can switch to 0.12 to see if the error is still there.
Sorry to disturb you again,How to solve this problem?
Hi,
I think this is due to the version of tensorflow. The input format of resnet bottleneck changed after 0.12. You can switch to 0.12 to see if the error is still there.
Hi,thanks for your reply.During my training process ,I often encounter the problem like this. Do you know why?
Sorry to disturb you again,How to solve this problem?
Hi, I think this is due to the version of tensorflow. The input format of resnet bottleneck changed after 0.12. You can switch to 0.12 to see if the error is still there.
Hi,thanks for your reply.During my training process ,I often encounter the problem like this. Do you know why?
I think the GPU memory is not enough. You can reduce the batch size of RPN to reduce the required memory.
HI @pengzhou1108 , I have the similar issues and my GPU is GeForce 1080, ~8G. After changing RPN batch size to 1, still doesn't work. Any ideas to figure this out?
Issues in log:
E tensorflow/stream_executor/cuda/cuda_fft.cc:169] failed to create cuFFT batched plan:2 E tensorflow/stream_executor/cuda/cuda_fft.cc:111] failed to run cuFFT routine cufftSetStream: 1 E tensorflow/stream_executor/cuda/cuda_fft.cc:169] failed to create cuFFT batched plan:2 W tensorflow/core/framework/op_kernel.cc:975] Internal: c2c fft failed : in.shape=[3136,16384] [[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]] Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1021, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1003, in _run_fn status, run_metadata) File "/usr/lib/python3.5/contextlib.py", line 66, in exit next(self.gen) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InternalError: c2c fft failed : in.shape=[3136,16384] [[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./tools/trainval_net.py", line 174, in
Caused by op 'noise_pred/FFT', defined at:
File "./tools/trainval_net.py", line 174, in
InternalError (see above for traceback): c2c fft failed : in.shape=[3136,16384] [[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]
HI @pengzhou1108 , I have the similar issues and my GPU is GeForce 1080, ~8G. After changing RPN batch size to 1, still doesn't work. Any ideas to figure this out?
Issues in log:
E tensorflow/stream_executor/cuda/cuda_fft.cc:169] failed to create cuFFT batched plan:2 E tensorflow/stream_executor/cuda/cuda_fft.cc:111] failed to run cuFFT routine cufftSetStream: 1 E tensorflow/stream_executor/cuda/cuda_fft.cc:169] failed to create cuFFT batched plan:2 W tensorflow/core/framework/op_kernel.cc:975] Internal: c2c fft failed : in.shape=[3136,16384] [[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]] Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1021, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1003, in _run_fn status, run_metadata) File "/usr/lib/python3.5/contextlib.py", line 66, in exit next(self.gen) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InternalError: c2c fft failed : in.shape=[3136,16384] [[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "./tools/trainval_net.py", line 174, in max_iters=args.max_iters) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 356, in train_net sw.train_model(sess, max_iters) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 247, in train_model self.net.train_step(sess, blobs, train_op) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/network_fusion.py", line 456, in train_step feed_dict=feed_dict) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 766, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 964, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: c2c fft failed : in.shape=[3136,16384] [[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]
Caused by op 'noise_pred/FFT', defined at: File "./tools/trainval_net.py", line 174, in max_iters=args.max_iters) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 356, in train_net sw.train_model(sess, max_iters) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 105, in train_model anchor_ratios=cfg.ANCHOR_RATIOS) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/network_fusion.py", line 377, in create_architecture rois, cls_prob, bbox_pred = self.build_network(sess, training) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/resnet_fusion.py", line 275, in build_network bilinear_pool=compact_bilinear_pooling_layer(fc7,noise_fc7,2048*8,compute_size=16,sequential=False) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/compact_bilinear_pooling/compact_bilinear_pooling.py", line 137, in compact_bilinear_pooling_layer sequential, compute_size) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/compact_bilinear_pooling/compact_bilinear_pooling.py", line 12, in _fft return tf.fft(bottom) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 800, in fft result = _op_def_lib.apply_op("FFT", input=input, name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1128, in init self._traceback = _extract_stack()
InternalError (see above for traceback): c2c fft failed : in.shape=[3136,16384] [[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]
Hi,
I guess you did not change the batch size correctly. How do you change the RPN batch size? You should change the number in cfgs/*.yml file instead of the files in lib folder, which is the actual parameter used for training and testing.
I set TRAIN.RPN_BATCHSIZE = 1 in train_faster_rcnn.sh to override the config. In the printed log I noticed the setting worked.
One update: I checked the log again and found in resnet_fusion.py when building_network, sequential = False in bilinear_pool. I changed this to True. and now it can run. Does sequential = True make sense?
_rois, cls_prob, bbox_pred = self.build_network(sess, training) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/resnet_fusion.py", line 275, in build_network bilinear_pool=compact_bilinear_pooling_layer(fc7,noise_fc7,2048*8,computesize=16,sequential=False)
I set TRAIN.RPN_BATCHSIZE = 1 in train_faster_rcnn.sh to override the config. In the printed log I noticed the setting worked.
One update: I checked the log again and found in resnet_fusion.py when building_network, sequential = False in bilinear_pool. I changed this to True. and now it can run. Does sequential = True make sense?
_rois, cls_prob, bbox_pred = self.build_network(sess, training) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/resnet_fusion.py", line 275, in build_network bilinear_pool=compact_bilinear_pooling_layer(fc7,noise_fc7,2048*8,computesize=16,sequential=False)
In the log figure you sent indicates the the dimensionality of bilinear layer is [3136(64X7x7), 16384], which has 64 as batch size (the default RPN batch size in yml file). In my case sequential is set to be False.
I set TRAIN.RPN_BATCHSIZE = 1 in train_faster_rcnn.sh to override the config. In the printed log I noticed the setting worked. One update: I checked the log again and found in resnet_fusion.py when building_network, sequential = False in bilinear_pool. I changed this to True. and now it can run. Does sequential = True make sense? _rois, cls_prob, bbox_pred = self.build_network(sess, training) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/resnet_fusion.py", line 275, in build_network bilinear_pool=compact_bilinear_pooling_layer(fc7,noise_fc7,2048*8,computesize=16,sequential=False)
In the log figure you sent indicates the the dimensionality of bilinear layer is [3136(64X7x7), 16384], which has 64 as batch size (the default RPN batch size in yml file). In my case sequential is set to be False.
En...Thanks! I will change the config yml directly.
Sorry to disturb you again,How to solve this problem?