Closed chuchienshu closed 6 years ago
Please enable debug to check the exact error:
net = FlowNetS(debug=True)
If the error is that (this was in my case) :
Couldn't open CUDA library libcupti.so.8.0. LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda/lib:/opt/ffmpeg/lib
2017-09-29 18:35:26.710728: F ./tensorflow/stream_executor/lib/statusor.h:212] Non-OK-status: status_ status: Failed precondition: could not dlopen DSO: libcupti.so.8.0; dlerror: libcupti.so.8.0: cannot open shared object file: No such file or directory
Please add /usr/local/cuda/extras/CUPTI/lib64/ to your LD_LIBRARY_PATH
Actually, the Segmentation Fault is still happening to me.
After running
gdb python
run src/flownet_s/train.py
I found this:
QueueRunner: corrupted record at 275253737
LE:
It seems SIGSEG is caused in C++ augmentation prerpocessing.so
:
Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fff7a7fc700 (LWP 13435)] std::_Function_handler<void(long long int, long long int), tensorflow::Augment(tensorflow::OpKernelContext*, const Device&, int, int, int, int, int, int, int, float const*, float*, float const*, float*) [with Device = Eigen::ThreadPoolDevice]::<lambda(tensorflow::int64, tensorflow::int64)> >::_M_invoke(const std::_Any_data &, <unknown type in /mnt/hdd1/git/caffe_experiments/obstacle-detection/flow/flownet2-tf/src/./ops/build/preprocessing.so, CU 0x72971, DIE 0xb7a61>, <unknown type in /mnt/hdd1/git/caffe_experiments/obstacle-detection/flow/flownet2-tf/src/./ops/build/preprocessing.so, CU 0x72971, DIE 0xb7a66>) (__functor=..., __args#0=<unknown type in /mnt/hdd1/git/caffe_experiments/obstacle-detection/flow/flownet2-tf/src/./ops/build/preprocessing.so, CU 0x72971, DIE 0xb7a61>, __args#1=<unknown type in /mnt/hdd1/git/caffe_experiments/obstacle-detection/flow/flownet2-tf/src/./ops/build/preprocessing.so, CU 0x72971, DIE 0xb7a66>) at /usr/include/c++/5/functional:1871 1871 (*_Base::_M_get_pointer(__functor))(
Thanks for your kindness, bro. I bow to your judgement, but I have no ability to change the code of prerpocessing.cc
. Fortunately, I got ideal result via comment part of that code and did some delicate modify.Thanks the same! @vladpaunescu
Can you please send in a pull request or let me know what you changed?
SP
On Sep 30, 2017, at 5:18 AM, chuchienshu notifications@github.com wrote:
Thanks for your kindness, bro. I bow to your judgement, but I have no ability to change the code of prerpocessing.cc. Fortunately, I got ideal result via comment part of that code and did some delicate modify.Thanks the same! @vladpaunescu
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
@chuchienshu No worries :+1:
Can you share the fix, cause I still can't train any model at all? All I get is Segmentation Fault. Also, it could help what is the exact TensorFlow version on which the code was implemented, as well as python version, and CUDA with CuDNN version. @sampepose can you kindly add the requirements in README.md?
Hi! I managed to train without any data augmentation. However it would be nice to have the C++ code working. So, any advice could be of great value for me.
Vlad
@sampepose I just removed the data augmentation code at src/dataloader.py
like below.
'''crop = [dataset_config['PREPROCESS']['crop_height'],
dataset_config['PREPROCESS']['crop_width']]
config_a = config_to_arrays(dataset_config['PREPROCESS']['image_a'])
......
# Perform flow augmentation using spatial parameters from data augmentation
flows = _preprocessing_ops.flow_augmentation(
flows, transforms_from_a, transforms_from_b, crop)'''
@vladpaunescu Looks you did the same, and I wish the C++ code working, too.It looks awesome!
The same problem happens to me,and I block out the code as @chuchienshu . But new problem comes out.The training process stop at an early step(0,30,60etc,random),and report "LossTensor is inf or nan". Have you met the same problem? Thank you very much! @sampepose @chuchienshu @vladpaunescu
I discover that the problem may owe to the wrong version of tf1.2 compiling the code.I change the version to 1.3,and recompile the code.Finally it works.
@yinjunbo Hi, I meet the same problem 'segmentation fault(core dumped)' and I remove the data augmentation code like the chuchienshu said. However, there are problems like this
File "/home/hmy/flownet2_tf/src/flownet2/train.py", line 22, in <module> './checkpoints/FlowNetSD/flownet-SD.ckpt-0': ('FlowNet2/FlowNetSD', 'FlowNet2') File "src/net.py", line 99, in train predictions = self.model(inputs, training_schedule) File "src/flownet2/flownet2.py", line 19, in model _, height, width, _ = inputs['input_a'].shape.as_list() ValueError: need more than 3 values to unpack
Have you met the same situation? Once you have solved such problems, it will be very kind of you to share some tips with me, thanks~
@myhooo Try to add this code before "return..." image_as, image_bs, flows = map(lambda x: tf.expand_dims(x, 0), [image_a, image_b, flow]) and don't forget to change corresponding variable in tf.train.batch .
@yinjunbo Thank you very much~ It seems that you have recompiled successfully, congratulation! And I want to know if you've just changed the version of tensorflow? Thank you in advance~
@myhooo You're welcome. I've recompiled it without data augmentation,and the version 1.2.0 seems to work.
Finally,I find it‘s a problem about the version of g++/gcc .
The default version when compiling MakeFile is 5.4.1 and it turns out that problem.
When I change the version of g++/gcc to 4.8.4 , everything works out well.
CC = gcc-4.8 -O2 -pthread CXX = g++-4.8
when I run
python -m src.flownet_s.train
,I get print info as below.os information: Ubuntu 16.04 cuda 8.0 tensorflow-gpu 1.2.1 Python 2.7.12
Anyone helps?Thanks a lot.