sampepose / flownet2-tf

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks
MIT License
404 stars 195 forks source link

Segmentation fault (core dumped) #10

Closed chuchienshu closed 6 years ago

chuchienshu commented 7 years ago

when I run python -m src.flownet_s.train,I get print info as below.

2017-09-26 07:49:25.716966: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-26 07:49:25.716993: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-26 07:49:25.716998: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-26 07:49:25.717002: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-26 07:49:25.717006: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-09-26 07:49:26.007158: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: TITAN Xp
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:02:00.0
Total memory: 11.90GiB
Free memory: 11.74GiB
2017-09-26 07:49:26.007182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-09-26 07:49:26.007186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-09-26 07:49:26.007192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0)
Segmentation fault (core dumped)

os information: Ubuntu 16.04 cuda 8.0 tensorflow-gpu 1.2.1 Python 2.7.12

Anyone helps?Thanks a lot.

vladpaunescu commented 7 years ago

Please enable debug to check the exact error:

net = FlowNetS(debug=True)

If the error is that (this was in my case) :

Couldn't open CUDA library libcupti.so.8.0. LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda/lib:/opt/ffmpeg/lib
2017-09-29 18:35:26.710728: F ./tensorflow/stream_executor/lib/statusor.h:212] Non-OK-status: status_ status: Failed precondition: could not dlopen DSO: libcupti.so.8.0; dlerror: libcupti.so.8.0: cannot open shared object file: No such file or directory

Please add /usr/local/cuda/extras/CUPTI/lib64/ to your LD_LIBRARY_PATH

https://github.com/tensorflow/tensorflow/issues/8830

vladpaunescu commented 7 years ago

Actually, the Segmentation Fault is still happening to me. After running gdb python run src/flownet_s/train.py I found this:

QueueRunner: corrupted record at 275253737

LE:

It seems SIGSEG is caused in C++ augmentation prerpocessing.so:

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fff7a7fc700 (LWP 13435)] std::_Function_handler<void(long long int, long long int), tensorflow::Augment(tensorflow::OpKernelContext*, const Device&, int, int, int, int, int, int, int, float const*, float*, float const*, float*) [with Device = Eigen::ThreadPoolDevice]::<lambda(tensorflow::int64, tensorflow::int64)> >::_M_invoke(const std::_Any_data &, <unknown type in /mnt/hdd1/git/caffe_experiments/obstacle-detection/flow/flownet2-tf/src/./ops/build/preprocessing.so, CU 0x72971, DIE 0xb7a61>, <unknown type in /mnt/hdd1/git/caffe_experiments/obstacle-detection/flow/flownet2-tf/src/./ops/build/preprocessing.so, CU 0x72971, DIE 0xb7a66>) (__functor=..., __args#0=<unknown type in /mnt/hdd1/git/caffe_experiments/obstacle-detection/flow/flownet2-tf/src/./ops/build/preprocessing.so, CU 0x72971, DIE 0xb7a61>, __args#1=<unknown type in /mnt/hdd1/git/caffe_experiments/obstacle-detection/flow/flownet2-tf/src/./ops/build/preprocessing.so, CU 0x72971, DIE 0xb7a66>) at /usr/include/c++/5/functional:1871 1871 (*_Base::_M_get_pointer(__functor))(

chuchienshu commented 7 years ago

Thanks for your kindness, bro. I bow to your judgement, but I have no ability to change the code of prerpocessing.cc. Fortunately, I got ideal result via comment part of that code and did some delicate modify.Thanks the same! @vladpaunescu

sampepose commented 7 years ago

Can you please send in a pull request or let me know what you changed?

SP

On Sep 30, 2017, at 5:18 AM, chuchienshu notifications@github.com wrote:

Thanks for your kindness, bro. I bow to your judgement, but I have no ability to change the code of prerpocessing.cc. Fortunately, I got ideal result via comment part of that code and did some delicate modify.Thanks the same! @vladpaunescu

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

vladpaunescu commented 7 years ago

@chuchienshu No worries :+1:

Can you share the fix, cause I still can't train any model at all? All I get is Segmentation Fault. Also, it could help what is the exact TensorFlow version on which the code was implemented, as well as python version, and CUDA with CuDNN version. @sampepose can you kindly add the requirements in README.md?

vladpaunescu commented 7 years ago

Hi! I managed to train without any data augmentation. However it would be nice to have the C++ code working. So, any advice could be of great value for me.

Vlad

chuchienshu commented 6 years ago

@sampepose I just removed the data augmentation code at src/dataloader.py like below.

'''crop = [dataset_config['PREPROCESS']['crop_height'],
                dataset_config['PREPROCESS']['crop_width']]
        config_a = config_to_arrays(dataset_config['PREPROCESS']['image_a'])
        ......
 # Perform flow augmentation using spatial parameters from data augmentation
            flows = _preprocessing_ops.flow_augmentation(
                flows, transforms_from_a, transforms_from_b, crop)'''

@vladpaunescu Looks you did the same, and I wish the C++ code working, too.It looks awesome!

yinjunbo commented 6 years ago

The same problem happens to me,and I block out the code as @chuchienshu . But new problem comes out.The training process stop at an early step(0,30,60etc,random),and report "LossTensor is inf or nan". Have you met the same problem? Thank you very much! @sampepose @chuchienshu @vladpaunescu

I discover that the problem may owe to the wrong version of tf1.2 compiling the code.I change the version to 1.3,and recompile the code.Finally it works.

myhooo commented 6 years ago

@yinjunbo Hi, I meet the same problem 'segmentation fault(core dumped)' and I remove the data augmentation code like the chuchienshu said. However, there are problems like this File "/home/hmy/flownet2_tf/src/flownet2/train.py", line 22, in <module> './checkpoints/FlowNetSD/flownet-SD.ckpt-0': ('FlowNet2/FlowNetSD', 'FlowNet2') File "src/net.py", line 99, in train predictions = self.model(inputs, training_schedule) File "src/flownet2/flownet2.py", line 19, in model _, height, width, _ = inputs['input_a'].shape.as_list() ValueError: need more than 3 values to unpack Have you met the same situation? Once you have solved such problems, it will be very kind of you to share some tips with me, thanks~

yinjunbo commented 6 years ago

@myhooo Try to add this code before "return..." image_as, image_bs, flows = map(lambda x: tf.expand_dims(x, 0), [image_a, image_b, flow]) and don't forget to change corresponding variable in tf.train.batch .

myhooo commented 6 years ago

@yinjunbo Thank you very much~ It seems that you have recompiled successfully, congratulation! And I want to know if you've just changed the version of tensorflow? Thank you in advance~

yinjunbo commented 6 years ago

@myhooo You're welcome. I've recompiled it without data augmentation,and the version 1.2.0 seems to work.

yinjunbo commented 6 years ago

Finally,I find it‘s a problem about the version of g++/gcc . The default version when compiling MakeFile is 5.4.1 and it turns out that problem. When I change the version of g++/gcc to 4.8.4 , everything works out well. CC = gcc-4.8 -O2 -pthread CXX = g++-4.8