lmb-freiburg / flownet2

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks
https://lmb.informatik.uni-freiburg.de/Publications/2017/IMKDB17/
Other
1k stars 318 forks source link

Fail to fine-tune FlowNet2 #152

Closed E1EV1 closed 6 years ago

E1EV1 commented 6 years ago

Hi,

I'm trying to fine-tune FlowNet2 with my own dataset. I formatted my database in lmdb and modified the FlowNet2_train.prototxt for fitting with my problematic.

Then when I started my training, I faced with a "Segmentation Fault" during the CustomDataLayerPrefetch and I don't know where the error comes from.

Any suggestion ?

capture_erreur_actuelle_sigsegv

E1EV1 commented 6 years ago

I just find a path of research in the Linux terminal, there is a layer that is not created. It could explain why I have a Segmentation Fault, I think the layer_pointer point to an undefined layer. I'll keep you informed

first abort

E1EV1 commented 6 years ago

Unfortunately, solving the problem of the creation of the layer_flow_gt_aug_FlowAugmentation1_0_split didn't change anything. I put you the screenshot of gdb where we could see that the Tread 6 received signal SIGSEGV. erreur debug

nikolausmayer commented 6 years ago

Hi, is it possible that there is a difference between the libraries used at compile time and the ones used at runtime? For example, a "popular" error is that people have multiple Caffe installations which interfere with each other.

E1EV1 commented 6 years ago

Thank you for your reply, normally there is no risk of this type. For the time being, I've install only Flownet2 with your Caffe version on this computer to avoid interference.

E1EV1 commented 6 years ago

For two days, I tried a lot of things that we can read on forums: I recompiled all FlowNet2, modified the .bashrc, modified the makefile.config but without success. I just find something to try, on the Nvidia documentation, we can read that Cuda 8 doesn't work correctly with gcc if gcc version > 5.3.1. My gcc is 5.4 so I will downgrade for testing.

If somebody has an idea I'm more than interested

E1EV1 commented 6 years ago

Modify the gcc version is useless, since Cuda 8.0.61 gcc 5.4 is allowed. I try to debug directly the thread now, I put the result of debug below if anyone finds an explanation.

'''Thread 6 "caffe" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffbef4f700 (LWP 18797)] 0x00007ffff747fe60 in void caffe::CustomDataLayerPrefetch(void) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3 (gdb) thread apply all bt

Thread 6 (Thread 0x7fffbef4f700 (LWP 18797)):

0 0x00007ffff747fe60 in void caffe::CustomDataLayerPrefetch(void) ()

from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

1 0x00007fffe08286ba in start_thread (arg=0x7fffbef4f700)

at pthread_create.c:333

2 0x00007ffff579d41d in clone ()

at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7fffc5003700 (LWP 18795)):

0 pthread_cond_timedwait@@GLIBC_2.3.2 ()

at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225

1 0x00007fffc6437a57 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

2 0x00007fffc63f02c7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

3 0x00007fffc6436e80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

4 0x00007fffe08286ba in start_thread (arg=0x7fffc5003700)

at pthread_create.c:333

5 0x00007ffff579d41d in clone ()

at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7fffc5804700 (LWP 18794)):

0 0x00007ffff579174d in poll () at ../sysdeps/unix/syscall-template.S:84

---Type to continue, or q to quit---

1 0x00007fffc643548b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

2 0x00007fffc649a78f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

3 0x00007fffc6436e80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

4 0x00007fffe08286ba in start_thread (arg=0x7fffc5804700)

at pthread_create.c:333

5 0x00007ffff579d41d in clone ()

at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7fffc6005700 (LWP 18793)):

0 0x00007ffff579e8c8 in accept4 (fd=17, addr=..., addr_len=0x7fffc6004a68,

flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:40

1 0x00007fffc6436216 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

2 0x00007fffc642a80d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

3 0x00007fffc6436e80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

4 0x00007fffe08286ba in start_thread (arg=0x7fffc6005700)

at pthread_create.c:333

5 0x00007ffff579d41d in clone ()

at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7fffc75f2700 (LWP 18791)):

0 0x00007ffff579174d in poll () at ../sysdeps/unix/syscall-template.S:84

1 0x00007fffd401f64c in ?? () from /lib/x86_64-linux-gnu/libusb-1.0.so.0

2 0x00007fffe08286ba in start_thread (arg=0x7fffc75f2700)

---Type to continue, or q to quit--- at pthread_create.c:333

3 0x00007ffff579d41d in clone ()

at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7ffff7f6db00 (LWP 18787)):

0 0x00007fffc600c454 in fatBinaryCtl ()

from /usr/lib/nvidia-390/libnvidia-fatbinaryloader.so.390.25

1 0x00007fffc6418fb0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

2 0x00007fffc6419bf3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

3 0x00007fffc6369de5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

4 0x00007fffc636a0f0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

5 0x00007fffe1e7ddcd in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7

6 0x00007fffe1e737f0 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7

7 0x00007fffe1e80f31 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7

8 0x00007fffe1e84621 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7

9 0x00007fffe1e781bc in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7

10 0x00007fffe1e5fff2 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7

11 0x00007fffe1e9a15f in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.7

12 0x00007fffe0fd6e21 in cudnnCreate ()

from /usr/lib/x86_64-linux-gnu/libcudnn.so.7

13 0x00007ffff749ec8f in caffe::CuDNNConvolutionLayer::LayerSetUp(std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&, std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&) ()

---Type to continue, or q to quit--- /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

14 0x00007ffff73b4065 in caffe::Net::Init(caffe::NetParameter const&) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

15 0x00007ffff73b5891 in caffe::Net::Net(caffe::NetParameter const&, caffe::Net const*) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

16 0x00007ffff73865ca in caffe::Solver::InitTrainNet() () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

17 0x00007ffff7387907 in caffe::Solver::Init(caffe::SolverParameter const&) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

18 0x00007ffff7387caa in caffe::Solver::Solver(caffe::SolverParameter const&, caffe::Solver const*) ()

from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

19 0x00007ffff7594c43 in caffe::Solver* caffe::Creator_AdamSolver(caffe::SolverParameter const&) ()

from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

20 0x000000000040a6e8 in train() ()

21 0x00000000004075a8 in main ()'''

nikolausmayer commented 6 years ago

Your backtrace indicates that you are using CuDNN version 7. We've only ever used version 5. I know that it's relatively easy to make the code compatible with version 6, but I never tried 7.

E1EV1 commented 6 years ago

Thank you for your reply, I will try by downgrading my CuDNN version. Normally it should not change much if I refer to https://github.com/lmb-freiburg/flownet2/issues/92 but we never know.

It's very strange, I can run FlowNet2 and build without problem but I can't train or fine-tune.

nikolausmayer commented 6 years ago

Hm, that's strange, but it really might be a problem with CuDNN. But it might be worth asking the people in #92 whether they actually used training, or just testing :wink:

E1EV1 commented 6 years ago

Thank you for all your help @nikolausmayer , Yes that's why I downgraded my CuDNN version but unfortunately I've always got the same issue :( I put the error message below.

My setup : Ubuntu 16.04, 980Ti, Cuda 8.0.61, CuDNN 5.1, gcc 5.4, python 3.5. If anyone have a suggestion I'm really interested :)

'''Thread 6 "caffe" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffc9f02700 (LWP 10153)] 0x00007ffff7481460 in void caffe::CustomDataLayerPrefetch(void) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3 (gdb) thread apply all bt

Thread 6 (Thread 0x7fffc9f02700 (LWP 10153)):

0 0x00007ffff7481460 in void caffe::CustomDataLayerPrefetch(void) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

1 0x00007fffe776c6ba in start_thread (arg=0x7fffc9f02700) at pthread_create.c:333

2 0x00007ffff579f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7fffcbf47700 (LWP 10152)):

0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225

1 0x00007fffcd37ba57 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

2 0x00007fffcd3342c7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

3 0x00007fffcd37ae80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

4 0x00007fffe776c6ba in start_thread (arg=0x7fffcbf47700) at pthread_create.c:333

5 0x00007ffff579f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7fffcc748700 (LWP 10151)):

0 0x00007ffff579374d in poll () at ../sysdeps/unix/syscall-template.S:84

1 0x00007fffcd37948b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

2 0x00007fffcd3de78f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

3 0x00007fffcd37ae80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

4 0x00007fffe776c6ba in start_thread (arg=0x7fffcc748700) at pthread_create.c:333

5 0x00007ffff579f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7fffccf49700 (LWP 10150)):

0 0x00007ffff57a08c8 in accept4 (fd=17, addr=..., addr_len=0x7fffccf48a68, flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:40

1 0x00007fffcd37a216 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

2 0x00007fffcd36e80d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

3 0x00007fffcd37ae80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

4 0x00007fffe776c6ba in start_thread (arg=0x7fffccf49700) at pthread_create.c:333

5 0x00007ffff579f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7fffce536700 (LWP 10148)):

0 0x00007ffff579374d in poll () at ../sysdeps/unix/syscall-template.S:84

1 0x00007fffdaf6364c in ?? () from /lib/x86_64-linux-gnu/libusb-1.0.so.0

2 0x00007fffe776c6ba in start_thread (arg=0x7fffce536700) at pthread_create.c:333

3 0x00007ffff579f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7ffff7f6db00 (LWP 10144)):

0 0x00007fffccf50454 in fatBinaryCtl () from /usr/lib/nvidia-390/libnvidia-fatbinaryloader.so.390.25

1 0x00007fffcd35cfb0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

2 0x00007fffcd35dbf3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

3 0x00007fffcd2adde5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

4 0x00007fffcd2ae0f0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

5 0x00007fffe842068d in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5

6 0x00007fffe84160b0 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5

7 0x00007fffe8423906 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5

8 0x00007fffe8426f11 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5

9 0x00007fffe841aa7c in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5

10 0x00007fffe84072d2 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5

11 0x00007fffe843ca4f in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5

12 0x00007fffe7eea714 in cudnnCreate () from /usr/lib/x86_64-linux-gnu/libcudnn.so.5

13 0x00007ffff749edb1 in caffe::CuDNNConvolutionLayer::LayerSetUp(std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&, std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

14 0x00007ffff73b5e25 in caffe::Net::Init(caffe::NetParameter const&) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

15 0x00007ffff73b7651 in caffe::Net::Net(caffe::NetParameter const&, caffe::Net const*) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

16 0x00007ffff738838a in caffe::Solver::InitTrainNet() () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

17 0x00007ffff73896c7 in caffe::Solver::Init(caffe::SolverParameter const&) () from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

18 0x00007ffff7389a6a in caffe::Solver::Solver(caffe::SolverParameter const&, caffe::Solver const*) ()

from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

19 0x00007ffff7595983 in caffe::Solver* caffe::Creator_AdamSolver(caffe::SolverParameter const&) ()

from /home/ewan/Documents/flownet2-master/.build_release/tools/../lib/libcaffe.so.1.0.0-rc3

20 0x000000000040a6e8 in train() ()'''

E1EV1 commented 6 years ago

Ok I found why I had an error !!! I had some pictures in my dataset which didn't have the same size than the others. Now all the images have the same size and I can fine-tune without SIGSEGV.

Thank you @nikolausmayer for your help

nikolausmayer commented 6 years ago

Nice job. I guess it would be good if the converters or data layers checked for this... :slightly_smiling_face: