tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.53k stars 74.18k forks source link

Segmentation fault in _pywrap_tensorflow_internal.so #16311

Closed mpkuse closed 3 years ago

mpkuse commented 6 years ago

System information

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

== cat /etc/issue ===============================================
Linux ubuntu 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a =====================================================
Linux ubuntu 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy (1.14.0)
protobuf (3.5.1)
tensorflow-gpu (1.4.1)
tensorflow-tensorboard (0.4.0)

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.4.1
tf.GIT_VERSION = v1.4.0-19-ga52c8d9
tf.COMPILER_VERSION = v1.4.0-19-ga52c8d9
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH :/usr/local/cuda/lib64/
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Tue Jan 23 11:09:12 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:04:00.0 Off |                  N/A |
| 40%   66C    P2   182W / 250W |  11763MiB / 12189MiB |     73%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   30C    P8     8W / 250W |  11591MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 28%   48C    P0    62W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:07:00.0 Off |                  N/A |
| 26%   46C    P0    63W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  TITAN Xp            Off  | 00000000:08:00.0 Off |                  N/A |
| 26%   46C    P0    63W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  TITAN Xp            Off  | 00000000:0C:00.0 Off |                  N/A |
| 23%   43C    P0    62W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  TITAN Xp            Off  | 00000000:0E:00.0 Off |                  N/A |
| 25%   44C    P0    62W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  TITAN Xp            Off  | 00000000:0F:00.0 Off |                  N/A |
| 42%   69C    P2   167W / 250W |  11833MiB / 12189MiB |     31%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     19682      C   /home/peiliang/tensorflow/bin/python       11751MiB |
|    1     19682      C   /home/peiliang/tensorflow/bin/python       11579MiB |
|    7     27581      C   python                                     11823MiB |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
/usr/local/cuda-8.0/lib64/libcudart_static.a
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-9.1/lib64/libcudart.so.9.1.85
/usr/local/cuda-9.1/lib64/libcudart_static.a
/usr/local/cuda-9.1/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.1/doc/man/man7/libcudart.7

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

My custom learning code works perfectly on my older workstation with 2 GPU cards. But am having issue with our new workstation which has 8 GPU cards. I get a Segmentation fault.

Source code / logs

The entire source code is: https://github.com/mpkuse/cartwheel_train/tree/config-files

The main-script is train_netvlad.py. Currently my learning data is private, if you really need it to test, I can provide the data as well (~100 GB).

My code basically builds a network with tf.slim. I have a custom operations to build a layer. Have a custom loss function. Can be found in CartWheelFlow.py/ class VGGDescriptor. It uses tf.while. Data is managed by class TimeMachineRender

stack trace for the crash.

$ gdb --args python train_netvlad.py -t tfsuper.logs/test 
(gdb) run
.
.
.
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffef7f61a2c in std::__detail::_Map_base<std::string, std::pair<std::string const, unsigned long>, std::allocator<std::pair<std::string const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
(gdb) where
#0  0x00007ffef7f61a2c in std::__detail::_Map_base<std::string, std::pair<std::string const, unsigned long>, std::allocator<std::pair<std::string const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#1  0x00007ffef7f6c4d9 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x00007ffef5ed94ea in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorfl---Type <return> to continue, or q <return> to quit---
ow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, TF_Tensor**, std::vector<std::string, std::allocator<std::string> > const&, TF_Buffer*, TF_Status*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007ffef5ed9824 in TF_Run ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007ffef5bf701a in tensorflow::TF_Run_wrapper_helper(TF_DeprecatedSession*, char const*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007ffef5bf7411 in tensorflow::TF_Run_wrapper(TF_DeprecatedSession*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/---Type <return> to continue, or q <return> to quit---
_pywrap_tensorflow_internal.so
#6  0x00007ffef5bbb6f1 in _wrap_TF_Run ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7  0x00000000004c45fa in PyEval_EvalFrameEx ()
#8  0x00000000004c2705 in PyEval_EvalCodeEx ()
#9  0x00000000004de69e in ?? ()
#10 0x00000000004b0c93 in PyObject_Call ()
#11 0x00000000004c6ef6 in PyEval_EvalFrameEx ()
#12 0x00000000004c2705 in PyEval_EvalCodeEx ()
#13 0x00000000004ca7df in PyEval_EvalFrameEx ()
#14 0x00000000004c2705 in PyEval_EvalCodeEx ()
#15 0x00000000004ca7df in PyEval_EvalFrameEx ()
#16 0x00000000004c2705 in PyEval_EvalCodeEx ()
#17 0x00000000004ca7df in PyEval_EvalFrameEx ()
#18 0x00000000004c2705 in PyEval_EvalCodeEx ()
#19 0x00000000004ca088 in PyEval_EvalFrameEx ()
#20 0x00000000004c2705 in PyEval_EvalCodeEx ()
#21 0x00000000004c24a9 in PyEval_EvalCode ()
#22 0x00000000004f19ef in ?? ()
#23 0x00000000004ec372 in PyRun_FileExFlags ()
#24 0x00000000004eaaf1 in PyRun_SimpleFileExFlags ()
#25 0x000000000049e208 in Py_Main ()
#26 0x00007ffff7810830 in __libc_start_main (main=0x49db30 <main>, argc=4, 
    argv=0x7fffffffe558, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffe548) at ../csu/libc-start.c:291
#27 0x000000000049da59 in _start ()
PeiliangLi commented 6 years ago

I have the similar issue when I follow these steps https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md to train object detection. I use tensorflow object detection API and the model checkpoint is download from http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_kitti_2017_11_08.tar.gz. The pipeline config file and datasets are exactly same with the provide one. The train.py can work normally for several steps, and then segmentation fault. Sample logs:

INFO:tensorflow:global step 4457: loss = 0.0524 (0.523 sec/step)
INFO:tensorflow:global step 4458: loss = 0.0071 (0.789 sec/step)
INFO:tensorflow:global step 4459: loss = 0.4697 (0.677 sec/step)
INFO:tensorflow:global step 4460: loss = 0.0488 (0.396 sec/step)
INFO:tensorflow:global step 4461: loss = 0.0560 (0.637 sec/step)
INFO:tensorflow:global step 4462: loss = 0.0424 (0.588 sec/step)
INFO:tensorflow:global step 4463: loss = 0.0227 (0.525 sec/step)
INFO:tensorflow:global step 4464: loss = 0.0826 (0.693 sec/step)

Thread 157 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff7b4ff9700 (LWP 19993)]
0x00007fff4d5cdd80 in std::_Hashtable<std::string, std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*>, std::allocator<std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
(gdb) where
#0  0x00007fff4d5cdd80 in std::_Hashtable<std::string, std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*>, std::allocator<std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#1  0x00007fff4d5cde25 in std::_Hashtable<std::string, std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*>, std::allocator<std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find(std::string const&) const () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#2  0x00007fff4d5ce6e1 in tensorflow::ResourceMgr::DoLookup(std::string const&, std::type_index, std::string const&, tensorflow::ResourceBase**) const () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#3  0x00007fff4fbd36e6 in tensorflow::GetStack(tensorflow::OpKernelContext*, tensorflow::Stack**) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007fff4fbd439d in tensorflow::StackPushOp<Eigen::GpuDevice>::ComputeAsync(tensorflow::OpKernelContext*, std::function<void ()>) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007fff4da034ea in tensorflow::BaseGPUDevice::ComputeAsync(tensorflow::AsyncOpKernel*, tensorflow::OpKernelContext*, std::function<void ()>) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#6  0x00007fff4da3a17b in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#7  0x00007fff4da3a3ea in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(tensorflow::gtl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8> const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#8  0x00007fff4d6d7782 in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#9  0x00007fff4d6d6832 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#10 0x00007fff462dfc80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007ffff7bc16ba in start_thread (arg=0x7ff7b4ff9700) at pthread_create.c:333
#12 0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

System information

tatatodd commented 6 years ago

@mpkuse @PeiliangLi it's not clear to me whether your segfaults are related or not. I took a look into @PeiliangLi's stacktrace (since it has more info) and I'm not sure exactly why we'd segfault in that logic.

Perhaps either/both of you can try running on our nightly builds, to see if that changes anything? https://pypi.python.org/pypi/tf-nightly-gpu https://github.com/tensorflow/tensorflow

mpkuse commented 6 years ago

I tried out the nightly build ( tf_nightly_gpu-1.6.0.dev20180125-cp27-cp27mu-manylinux1_x86_64.whl )

Basically uninstalled my previously installed version. For it to work, had to also install CUDA-9.0 and cudnn7.

My codes worked for a while (usually about 50 iterations). Still SIGSEGV and get the same stack trace. Is it helpful to mention that I use feed_dict in tf.run() to set my current batch for training.

Myself and Peiliang Li are using the same pysical computer.

Here is the output from tf_env_collect, after the new nightly installation

== tensorflow import ============================================ tf.VERSION = 1.6.0-dev20180125 tf.GIT_VERSION = v1.5.0-rc1-1632-g6743031 tf.COMPILER_VERSION = v1.5.0-rc1-1632-g6743031 Sanity check: array([1], dtype=int32)

== env ========================================================== LD_LIBRARY_PATH :/usr/local/cuda/lib64/:/usr/local/cuda/lib64/ DYLD_LIBRARY_PATH is unset == cuda libs =================================================== /usr/local/cuda-9.0/lib64/libcudart_static.a /usr/local/cuda-9.0/lib64/libcudart.so.9.0.176 /usr/local/cuda-9.0/doc/man/man7/libcudart.so.7 /usr/local/cuda-9.0/doc/man/man7/libcudart.7 /usr/local/cuda-8.0/lib64/libcudart.so.8.0.61 /usr/local/cuda-8.0/lib64/libcudart_static.a /usr/local/cuda-8.0/doc/man/man7/libcudart.so.7 /usr/local/cuda-8.0/doc/man/man7/libcudart.7 /usr/local/cuda-9.1/lib64/libcudart.so.9.1.85 /usr/local/cuda-9.1/lib64/libcudart_static.a /usr/local/cuda-9.1/doc/man/man7/libcudart.so.7 /usr/local/cuda-9.1/doc/man/man7/libcudart.7

Here is my gdb bracktrace

#0  0x00007ffefff0c9ab in std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#1  0x00007ffefff17fc6 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x00007ffefd53b791 in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Tensor**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Buffer*, TF_Status*) [clone .constprop.698] ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007ffefd53c06a in TF_Run ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007ffefd237c8b in tensorflow::TF_Run_wrapper_helper(TF_DeprecatedSession*, char const*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007ffefd237d93 in tensorflow::TF_Run_wrapper(TF_DeprecatedSession*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6  0x00007ffefd1f9fa4 in _wrap_TF_Run ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7  0x00000000004c45fa in PyEval_EvalFrameEx ()
#8  0x00000000004c2705 in PyEval_EvalCodeEx ()
#9  0x00000000004de69e in ?? ()
#10 0x00000000004b0c93 in PyObject_Call ()
#11 0x00000000004c6ef6 in PyEval_EvalFrameEx ()
#12 0x00000000004c2705 in PyEval_EvalCodeEx ()
#13 0x00000000004ca7df in PyEval_EvalFrameEx ()
#14 0x00000000004c2705 in PyEval_EvalCodeEx ()
#15 0x00000000004ca7df in PyEval_EvalFrameEx ()
#16 0x00000000004c2705 in PyEval_EvalCodeEx ()
#17 0x00000000004ca7df in PyEval_EvalFrameEx ()
#18 0x00000000004c2705 in PyEval_EvalCodeEx ()
#19 0x00000000004ca088 in PyEval_EvalFrameEx ()
#20 0x00000000004c2705 in PyEval_EvalCodeEx ()
#21 0x00000000004c24a9 in PyEval_EvalCode ()
#22 0x00000000004f19ef in ?? ()
#23 0x00000000004ec372 in PyRun_FileExFlags ()
#24 0x00000000004eaaf1 in PyRun_SimpleFileExFlags ()
#25 0x000000000049e208 in Py_Main ()
#26 0x00007ffff7810830 in __libc_start_main (main=0x49db30 <main>, argc=6, argv=0x7fffffffe478, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe468) at ../csu/libc-start.c:291
#27 0x000000000049da59 in _start ()

I also noticed that the process uses 65 threads. Used cat /proc/22222/status. Is this normal behaviour?

mpkuse commented 6 years ago

Here is a standalone script for my code. https://gist.github.com/mpkuse/10985d9ca11d9555a3ef065aab5070d3

Sample execution: 1 (17, 4096) 2 (17, 4096) 3 (17, 4096) 4 (17, 4096) 5 (17, 4096) 6 (17, 4096) 7 (17, 4096) 8 (17, 4096)

I tested it on an older version and a newer version of tensorflow (on separate computers). I still get a segmentation fault on the newer computer. However, it works well on the older computer.

Following are the details of the config of the older computer.

== cat /etc/issue ===============================================
Linux mpkusex-ri-desktop2 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
VERSION="14.04.3 LTS, Trusty Tahr"
VERSION_ID="14.04"

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a =====================================================
Linux mpkusex-ri-desktop2 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy (1.13.0)
protobuf (3.3.0)

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.0.1
tf.GIT_VERSION = v1.0.0-65-g4763edf-dirty
tf.COMPILER_VERSION = v1.0.0-65-g4763edf-dirty
Sanity check: array([1], dtype=int32)
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally

== env ==========================================================
LD_LIBRARY_PATH /home/mpkusex/ros_workspaces/trial_ws/devel/lib/x86_64-linux-gnu:/home/mpkusex/ros_workspaces/robotic_vision/devel/lib/x86_64-linux-gnu:/home/mpkusex/ros_installation/install_isolated/lib/x86_64-linux-gnu:/home/mpkusex/ros_workspaces/trial_ws/devel/lib:/home/mpkusex/ros_workspaces/robotic_vision/devel/lib:/home/mpkusex/caffe/build/install/share/lib:/home/mpkusex/caffe/build/install/share/lib/x86_64-linux-gnu:/home/mpkusex/ros_installation/install_isolated/lib:/usr/local/cuda/lib64:/opt/gurobi651/linux64/lib
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Mon Jan 29 12:15:38 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 690     Off  | 0000:03:00.0     N/A |                  N/A |
| 42%   59C    P0    N/A /  N/A |      1MiB /  1999MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 690     Off  | 0000:04:00.0     N/A |                  N/A |
| 35%   50C    P0    N/A /  N/A |    825MiB /  1998MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
|    1                  Not Supported                                         |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.61
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-7.5/doc/man/man7/libcudart.so.7
/usr/local/cuda-7.5/doc/man/man7/libcudart.7
/usr/local/cuda-7.5/targets/x86_64-linux/lib/libcudart.so.7.5.18
/usr/local/cuda-7.5/targets/x86_64-linux/lib/libcudart_static.a

following is the backtrace on the new computer. Configuration of new computer are in 1st message of this thread.

Starting program: /usr/bin/python -i test_tfdata.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3418700 (LWP 48450)]
[New Thread 0x7ffff2c17700 (LWP 48451)]
[New Thread 0x7ffff0416700 (LWP 48452)]
[New Thread 0x7fffebc15700 (LWP 48453)]
[New Thread 0x7fffeb414700 (LWP 48454)]
[New Thread 0x7fffe6c13700 (LWP 48455)]
[New Thread 0x7fffe4412700 (LWP 48456)]
[New Thread 0x7fffe1c11700 (LWP 48457)]
[New Thread 0x7fffdf410700 (LWP 48458)]
[New Thread 0x7fffdcc0f700 (LWP 48459)]
[New Thread 0x7fffda40e700 (LWP 48460)]
[New Thread 0x7fffd7c0d700 (LWP 48461)]
[New Thread 0x7fffd540c700 (LWP 48462)]
[New Thread 0x7fffd2c0b700 (LWP 48463)]
[New Thread 0x7fffd040a700 (LWP 48464)]
[New Thread 0x7fffcdc09700 (LWP 48465)]
[New Thread 0x7fffcb408700 (LWP 48466)]
[New Thread 0x7fffcac07700 (LWP 48467)]
[New Thread 0x7fffca406700 (LWP 48468)]
[New Thread 0x7fffc9c05700 (LWP 48469)]
[New Thread 0x7fffc3404700 (LWP 48470)]
[New Thread 0x7fffbec03700 (LWP 48471)]
[New Thread 0x7fffbc402700 (LWP 48472)]
[New Thread 0x7fffb9c01700 (LWP 48473)]
[New Thread 0x7fffb7400700 (LWP 48474)]
[New Thread 0x7fffb4bff700 (LWP 48475)]
[New Thread 0x7fffb23fe700 (LWP 48476)]
[New Thread 0x7fffafbfd700 (LWP 48477)]
[New Thread 0x7fffad3fc700 (LWP 48478)]
[New Thread 0x7fffaabfb700 (LWP 48479)]
[New Thread 0x7fffa83fa700 (LWP 48480)]
[New Thread 0x7fffa5bf9700 (LWP 48481)]
[New Thread 0x7fffa53f8700 (LWP 48482)]
[New Thread 0x7fffa2bf7700 (LWP 48483)]
[New Thread 0x7fffa03f6700 (LWP 48484)]
[New Thread 0x7fff9dbf5700 (LWP 48485)]
[New Thread 0x7fff9b3f4700 (LWP 48486)]
[New Thread 0x7fff98bf3700 (LWP 48487)]
[New Thread 0x7fff963f2700 (LWP 48488)]
[New Thread 0x7fff93bf1700 (LWP 48489)]
[New Thread 0x7fff933f0700 (LWP 48490)]
[New Thread 0x7fff8ebef700 (LWP 48491)]
[New Thread 0x7fff8c3ee700 (LWP 48492)]
[New Thread 0x7fff89bed700 (LWP 48493)]
[New Thread 0x7fff873ec700 (LWP 48494)]
[New Thread 0x7fff84beb700 (LWP 48495)]
[New Thread 0x7fff823ea700 (LWP 48496)]
[New Thread 0x7fff7fbe9700 (LWP 48497)]
[New Thread 0x7fff7d3e8700 (LWP 48498)]
[New Thread 0x7fff7abe7700 (LWP 48499)]
[New Thread 0x7fff783e6700 (LWP 48500)]
[New Thread 0x7fff75be5700 (LWP 48501)]
[New Thread 0x7fff733e4700 (LWP 48502)]
[New Thread 0x7fff70be3700 (LWP 48503)]
[New Thread 0x7fff6e3e2700 (LWP 48504)]
[Thread 0x7fffdf410700 (LWP 48458) exited]
[Thread 0x7fff8c3ee700 (LWP 48492) exited]
[Thread 0x7fff7abe7700 (LWP 48499) exited]
[Thread 0x7fff7d3e8700 (LWP 48498) exited]
[Thread 0x7fff7fbe9700 (LWP 48497) exited]
[Thread 0x7fff823ea700 (LWP 48496) exited]
[Thread 0x7fff873ec700 (LWP 48494) exited]
[Thread 0x7fff8ebef700 (LWP 48491) exited]
[Thread 0x7fff933f0700 (LWP 48490) exited]
[Thread 0x7fff93bf1700 (LWP 48489) exited]
[Thread 0x7fff963f2700 (LWP 48488) exited]
[Thread 0x7fff98bf3700 (LWP 48487) exited]
[Thread 0x7fff9b3f4700 (LWP 48486) exited]
[Thread 0x7fffa03f6700 (LWP 48484) exited]
[Thread 0x7fffa2bf7700 (LWP 48483) exited]
[Thread 0x7fffa5bf9700 (LWP 48481) exited]
[Thread 0x7fffa83fa700 (LWP 48480) exited]
[Thread 0x7fffaabfb700 (LWP 48479) exited]
[Thread 0x7fffad3fc700 (LWP 48478) exited]
[Thread 0x7fffafbfd700 (LWP 48477) exited]
[Thread 0x7fffb23fe700 (LWP 48476) exited]
[Thread 0x7fffb9c01700 (LWP 48473) exited]
[Thread 0x7fffbc402700 (LWP 48472) exited]
[Thread 0x7fffbec03700 (LWP 48471) exited]
[Thread 0x7fffc3404700 (LWP 48470) exited]
[Thread 0x7fffc9c05700 (LWP 48469) exited]
[Thread 0x7fff783e6700 (LWP 48500) exited]
[Thread 0x7fff89bed700 (LWP 48493) exited]
[Thread 0x7fff6e3e2700 (LWP 48504) exited]
[Thread 0x7fff733e4700 (LWP 48502) exited]
[Thread 0x7fff70be3700 (LWP 48503) exited]
[Thread 0x7fff75be5700 (LWP 48501) exited]
[Thread 0x7fff84beb700 (LWP 48495) exited]
[Thread 0x7fff9dbf5700 (LWP 48485) exited]
[Thread 0x7fffa53f8700 (LWP 48482) exited]
[Thread 0x7fffb4bff700 (LWP 48475) exited]
[Thread 0x7fffb7400700 (LWP 48474) exited]
[Thread 0x7fffca406700 (LWP 48468) exited]
[Thread 0x7fffcac07700 (LWP 48467) exited]
[Thread 0x7fffcb408700 (LWP 48466) exited]
[Thread 0x7fffcdc09700 (LWP 48465) exited]
[Thread 0x7fffd040a700 (LWP 48464) exited]
[Thread 0x7fffd2c0b700 (LWP 48463) exited]
[Thread 0x7fffd540c700 (LWP 48462) exited]
[Thread 0x7fffd7c0d700 (LWP 48461) exited]
[Thread 0x7fffda40e700 (LWP 48460) exited]
[Thread 0x7fffdcc0f700 (LWP 48459) exited]
[Thread 0x7fffe1c11700 (LWP 48457) exited]
[Thread 0x7fffe4412700 (LWP 48456) exited]
[Thread 0x7fffe6c13700 (LWP 48455) exited]
[Thread 0x7fffeb414700 (LWP 48454) exited]
[Thread 0x7fffebc15700 (LWP 48453) exited]
[Thread 0x7ffff0416700 (LWP 48452) exited]
[Thread 0x7ffff2c17700 (LWP 48451) exited]
[Thread 0x7ffff3418700 (LWP 48450) exited]
[New Thread 0x7fff6e3e2700 (LWP 48545)]
[New Thread 0x7fff70be3700 (LWP 48546)]
[New Thread 0x7fff733e4700 (LWP 48547)]
[New Thread 0x7fff75be5700 (LWP 48548)]
[New Thread 0x7ffedde56700 (LWP 48549)]
[New Thread 0x7ffedd655700 (LWP 48550)]
[New Thread 0x7ffedce54700 (LWP 48551)]
[New Thread 0x7ffebffff700 (LWP 48552)]
[New Thread 0x7ffebf7fe700 (LWP 48553)]
[New Thread 0x7ffebeffd700 (LWP 48554)]
[New Thread 0x7ffebe7fc700 (LWP 48555)]
[New Thread 0x7ffebdffb700 (LWP 48556)]
[New Thread 0x7ffebd7fa700 (LWP 48557)]
[New Thread 0x7ffebcff9700 (LWP 48558)]
[New Thread 0x7ffe9ffff700 (LWP 48559)]
[New Thread 0x7ffe9f7fe700 (LWP 48560)]
[New Thread 0x7ffe9effd700 (LWP 48561)]
[New Thread 0x7ffe9e7fc700 (LWP 48562)]
[New Thread 0x7ffe9dffb700 (LWP 48563)]
[New Thread 0x7ffe9d7fa700 (LWP 48564)]
[New Thread 0x7ffe9cff9700 (LWP 48565)]
[New Thread 0x7ffe7ffff700 (LWP 48566)]
[New Thread 0x7ffe7f7fe700 (LWP 48567)]
[New Thread 0x7ffe7effd700 (LWP 48568)]
[New Thread 0x7ffe7e7fc700 (LWP 48569)]
[New Thread 0x7ffe7dffb700 (LWP 48570)]
[New Thread 0x7ffe7d7fa700 (LWP 48571)]
[New Thread 0x7ffe7cff9700 (LWP 48572)]
[New Thread 0x7ffe5ffff700 (LWP 48573)]
[New Thread 0x7ffe5f7fe700 (LWP 48574)]
[New Thread 0x7ffe5effd700 (LWP 48575)]
[New Thread 0x7ffe5e7fc700 (LWP 48576)]
[New Thread 0x7ffe5dffb700 (LWP 48577)]
[New Thread 0x7ffe5d7fa700 (LWP 48578)]
[New Thread 0x7ffe5cff9700 (LWP 48579)]
[New Thread 0x7ffe3ffff700 (LWP 48580)]
[New Thread 0x7ffe3f7fe700 (LWP 48581)]
[New Thread 0x7ffe3effd700 (LWP 48582)]
[New Thread 0x7ffe3e7fc700 (LWP 48583)]
[New Thread 0x7ffe3dffb700 (LWP 48584)]
[New Thread 0x7ffe3d7fa700 (LWP 48585)]
[New Thread 0x7ffe3cff9700 (LWP 48586)]
[New Thread 0x7ffe1ffff700 (LWP 48587)]
[New Thread 0x7ffe1f7fe700 (LWP 48588)]
[New Thread 0x7ffe1effd700 (LWP 48589)]
[New Thread 0x7ffe1e7fc700 (LWP 48590)]
[New Thread 0x7ffe1dffb700 (LWP 48591)]
[New Thread 0x7ffe1d7fa700 (LWP 48592)]
[New Thread 0x7ffe1cff9700 (LWP 48593)]
[New Thread 0x7ffdfffff700 (LWP 48594)]
[New Thread 0x7ffdff7fe700 (LWP 48595)]
[New Thread 0x7ffdfeffd700 (LWP 48597)]
[New Thread 0x7ffdfe7fc700 (LWP 48598)]
[New Thread 0x7ffdfdffb700 (LWP 48599)]
[New Thread 0x7ffdfd7fa700 (LWP 48600)]
[New Thread 0x7ffdfcff9700 (LWP 48601)]
[New Thread 0x7ffdd9fff700 (LWP 48606)]
[New Thread 0x7ffdd97fe700 (LWP 48607)]
[New Thread 0x7ffdd0ffd700 (LWP 48608)]
[New Thread 0x7ffdd8ffd700 (LWP 48612)]
[New Thread 0x7ffdd2bff700 (LWP 48616)]
[New Thread 0x7ffdd23fe700 (LWP 48617)]
[New Thread 0x7ffdd1bfd700 (LWP 48618)]
[New Thread 0x7ffac3fff700 (LWP 48619)]
[New Thread 0x7ffacb4c9700 (LWP 48620)]
[New Thread 0x7ffacacc8700 (LWP 48621)]
[New Thread 0x7ffaca4c7700 (LWP 48622)]
[New Thread 0x7ffac9cc6700 (LWP 48623)]
[New Thread 0x7ffac94c5700 (LWP 48624)]
[New Thread 0x7ffac8cc4700 (LWP 48625)]
[New Thread 0x7ffac37fe700 (LWP 48626)]
[New Thread 0x7ffac2ffd700 (LWP 48627)]
[New Thread 0x7ffac27fc700 (LWP 48628)]
[New Thread 0x7ffac1ffb700 (LWP 48629)]
[New Thread 0x7ffac17fa700 (LWP 48630)]
[New Thread 0x7ffac0ff9700 (LWP 48631)]
[New Thread 0x7ffa8bfff700 (LWP 48632)]
[New Thread 0x7ffa8b7fe700 (LWP 48633)]
[New Thread 0x7ffa8affd700 (LWP 48634)]
[New Thread 0x7ffa8a7fc700 (LWP 48635)]
[New Thread 0x7ffa89ffb700 (LWP 48636)]
[New Thread 0x7ffa897fa700 (LWP 48637)]
[New Thread 0x7ffa88ff9700 (LWP 48638)]
[New Thread 0x7ffa6bfff700 (LWP 48639)]
[New Thread 0x7ffa6b7fe700 (LWP 48640)]
[New Thread 0x7ffa6affd700 (LWP 48641)]
[New Thread 0x7ffa6a7fc700 (LWP 48642)]
[New Thread 0x7ffa69ffb700 (LWP 48643)]
[New Thread 0x7ffa697fa700 (LWP 48644)]
[New Thread 0x7ffa68ff9700 (LWP 48645)]
[New Thread 0x7ffa4bfff700 (LWP 48646)]
[New Thread 0x7ffa4b7fe700 (LWP 48647)]
[New Thread 0x7ffa4affd700 (LWP 48648)]
[New Thread 0x7ffa4a7fc700 (LWP 48649)]
[New Thread 0x7ffa49ffb700 (LWP 48650)]
[New Thread 0x7ffa497fa700 (LWP 48651)]
[New Thread 0x7ffa48ff9700 (LWP 48652)]
[New Thread 0x7ffa2bfff700 (LWP 48653)]
[New Thread 0x7ffa2b7fe700 (LWP 48654)]
[New Thread 0x7ffa2affd700 (LWP 48655)]
[New Thread 0x7ffa2a7fc700 (LWP 48656)]
[New Thread 0x7ffa29ffb700 (LWP 48657)]
[New Thread 0x7ffa297fa700 (LWP 48658)]
[New Thread 0x7ffa28ff9700 (LWP 48659)]
[New Thread 0x7ffa0bfff700 (LWP 48660)]
[New Thread 0x7ffa03fff700 (LWP 48661)]
[New Thread 0x7ffa0b7fe700 (LWP 48662)]
[New Thread 0x7ffa0affd700 (LWP 48663)]
[New Thread 0x7ffa0a7fc700 (LWP 48664)]
[New Thread 0x7ffa09ffb700 (LWP 48665)]
[New Thread 0x7ffa097fa700 (LWP 48666)]
[New Thread 0x7ffa08ff9700 (LWP 48667)]
[New Thread 0x7ffa037fe700 (LWP 48668)]
[New Thread 0x7ffa02ffd700 (LWP 48669)]
[New Thread 0x7ffa027fc700 (LWP 48670)]
[New Thread 0x7ffa01ffb700 (LWP 48671)]
[New Thread 0x7ffa017fa700 (LWP 48672)]
[New Thread 0x7ffa00ff9700 (LWP 48673)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fff54286fcd in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::pair<unsigned long long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const, tensorflow::ResourceBase*> > >*>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::pair<unsigned long long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const, tensorflow::ResourceBase*> > >*> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) const ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#0  0x00007fff54286fcd in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::pair<unsigned long long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const, tensorflow::ResourceBase*> > >*>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::pair<unsigned long long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const, tensorflow::ResourceBase*> > >*> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) const ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#1  0x00007fff542890dd in tensorflow::ResourceMgr::Cleanup(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#2  0x00007fff58fa2972 in std::_Function_handler<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), tensorflow::DirectSession::RunState::RunState(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long long, std::vector<tensorflow::Device*, std::allocator<tensorflow::Device*> > const*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#1}>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007fff58fa5b25 in tensorflow::DirectSession::RunState::~RunState() ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007fff58fb0461 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007fff565d4791 in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Tensor**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Buffer*, TF_Status*) [clone .constprop.698] ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6  0x00007fff565d506a in TF_Run ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7  0x00007fff562d0c8b in tensorflow::TF_Run_wrapper_helper(TF_DeprecatedSession*, char const*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8  0x00007fff562d0d93 in tensorflow::TF_Run_wrapper(TF_DeprecatedSession*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9  0x00007fff56292fa4 in _wrap_TF_Run ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00000000004c45fa in PyEval_EvalFrameEx ()
#11 0x00000000004c2705 in PyEval_EvalCodeEx ()
#12 0x00000000004de69e in ?? ()
#13 0x00000000004b0c93 in PyObject_Call ()
#14 0x00000000004c6ef6 in PyEval_EvalFrameEx ()
#15 0x00000000004c2705 in PyEval_EvalCodeEx ()
#16 0x00000000004ca7df in PyEval_EvalFrameEx ()
#17 0x00000000004c2705 in PyEval_EvalCodeEx ()
#18 0x00000000004ca7df in PyEval_EvalFrameEx ()
#19 0x00000000004c2705 in PyEval_EvalCodeEx ()
#20 0x00000000004ca7df in PyEval_EvalFrameEx ()
#21 0x00000000004c2705 in PyEval_EvalCodeEx ()
#22 0x00000000004ca088 in PyEval_EvalFrameEx ()
#23 0x00000000004c2705 in PyEval_EvalCodeEx ()
#24 0x00000000004c24a9 in PyEval_EvalCode ()
#25 0x00000000004f19ef in ?? ()
#26 0x00000000004ec372 in PyRun_FileExFlags ()
#27 0x00000000004eaaf1 in PyRun_SimpleFileExFlags ()
#28 0x000000000049e208 in Py_Main ()
#29 0x00007ffff7810830 in __libc_start_main (main=0x49db30 <main>, argc=3, 
    argv=0x7fffffffe488, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffe478)
    at ../csu/libc-start.c:291
#30 0x000000000049da59 in _start ()
quit
tatatodd commented 6 years ago

@asimshankar any thoughts on this, or can you suggest someone who might be able to look into this?

mpkuse commented 6 years ago

Any updates on this?

skye commented 6 years ago

Sorry for the delay... @mpkuse and @PeiliangLi, are either of you still experiencing these segfaults? It's possible they were fixed since you orginally posted.

mpkuse commented 6 years ago

I upgraded to tensorflow 1.6.0. Still experiencing the same issue. Here is my standalone code that replicates the issue: https://gist.github.com/mpkuse/10985d9ca11d9555a3ef065aab5070d3

skye commented 6 years ago

@mpkuse do you experience the crash when you disable GPU?

PeiliangLi commented 6 years ago

@skye No, we did try the same version tensorflow on CPU, and no crash.

ghost commented 6 years ago

any updates on this? gdb output:

Thread 41 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff1cff9700 (LWP 28774)]
0x00007fffd4c14998 in Eigen::internal::gemm_pack_rhs<float, long, 
Eigen::internal::TensorContractionSubMapper<float, long, 0, 
Eigen::TensorEvaluator<Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, 
Eigen::TensorVolumePatchOp<-1l, -1l, -1l, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, 
Eigen::MakePointer> const> const> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, 
Eigen::array<long, 1ul>, 8, true, false, 0, Eigen::MakePointer>, 4, 0, false, false>::operator()(float*, 
Eigen::internal::TensorContractionSubMapper<float, long, 0, 
Eigen::TensorEvaluator<Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, 
Eigen::TensorVolumePatchOp<-1l, -1l, -1l, Eigen::TensorMap<Eigen::Tensor<float const, 5, 1, long>, 16, 
Eigen::MakePointer> const> const> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, 
Eigen::array<long, 1ul>, 8, true, false, 0, Eigen::MakePointer> const&, long, long, long, long) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
anhnt1 commented 6 years ago

I also have this error. Any updates so far?

ghost commented 6 years ago

After checking once more all the installed packages, I found out that somehow by mistake I installed Tensorflow for CPU. By uninstalling the tensorflow and installing tensorflow-gpu the problem is solved.

caot commented 6 years ago

ran into the same issue on GPU with conda install tensorflow-gpu and/or build from source, however tensorflow worked on CPU,

(gdb) run
Starting program: /anaconda3/envs/tensorflow-gpu/bin/python 
[Thread debugging using libthread_db enabled]
Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:39:56) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
Missing separate debuginfo for /anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/numpy/../../../libiomp5.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/2f/ffee478c58c351d3624c7aeee95c351cdacfea.debug

Program received signal SIGSEGV, Segmentation fault.
0x00007fffd7b7c5e4 in _ZSt9call_onceIRFvvEJEEvRSt9once_flagOT_DpOT0_ () from /anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6_5.4.x86_64
(gdb) runbt full
Undefined command: "runbt".  Try "help".
(gdb) bt full
#0  0x00007fffd7b7c5e4 in _ZSt9call_onceIRFvvEJEEvRSt9once_flagOT_DpOT0_ () from /anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
No symbol table info available.
#1  0x00007fffd7b7c64e in tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from /anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
No symbol table info available.
#2  0x00007fffd7b7c341 in tensorflow::port::(anonymous namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
No symbol table info available.
#3  0x00007fffd7b7c394 in _GLOBAL__sub_I_cpu_feature_guard.cc () from /anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
No symbol table info available.
#4  0x000000307ae0e59f in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#5  0x000000307ae12cb5 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#6  0x000000307ae0e1b6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#7  0x000000307ae124fa in _dl_open () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#8  0x000000368ae00f66 in dlopen_doit () from /lib64/libdl.so.2
No symbol table info available.
#9  0x000000307ae0e1b6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#10 0x000000368ae0129c in _dlerror_run () from /lib64/libdl.so.2
No symbol table info available.
#11 0x000000368ae00ee1 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
No symbol table info available.
#12 0x00007ffff7c79161 in _PyImport_FindSharedFuncptr (prefix=0x7ffff7d026a6 "PyInit", shortname=0x7fffebee3410 "_pywrap_tensorflow_internal", 
    pathname=0x7fffebe67050 "/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so", fp=0x0) at ./Python/dynload_shlib.c:95
        p = <value optimized out>
        handle = <value optimized out>
        funcname = "PyInit__pywrap_tensorflow_internal\000\000\001", '\000' <repeats 11 times>, "\006\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\062.\273\367\377\177\000\000@ֶ\361\377\177\000\000 k\377\377\377\177", '\000' <repeats 18 times>, " k\377\377\377\177\000\000\240\003\362\353\377\177\000\000\240\003\362\353\377\177\000\000\022\026\276\367\377\177\000\000\000\000\000\000\000\000\000\000X \320\367\377\177\000\000\000\000\000\000\000\000\000\000\241D\276\367\377\177\000\000\330p!\354\377\177\000\000\360\063\356\353\377\177\000\000H2\356\353\377\177\000\000\251\005\304\367\377\177\000\000\240\003\362\353\377\177\000\000\004\273\267\367\377\177\000\000\320j\377\377\377\177\000\000\020k\377\377\377\177\000\000\020k\377\377\377\177\000\000\377\177\000\000\000\000\000\000\210"
        pathbuf = "\033T(\004\212;=\245X \320\367\377\177\000\000origin\000\000\330p!\354\377\177\000\000\360\330\355\353\377\177\000\000\030\363\355\353\377\177\000\000\240\227\273\361\377\177\000\000\033T(\004\212;=\245@\322\355\353\377\177\000\000\370=a", '\000' <repeats 13 times>"\246, շ\367\377\177\000\000\000\311\355\353\377\177\000\000 k\377\377\377\177\000\000\310q\227\000\000\000\000\000\316\b\271\367\377\177\000\000\001\000\000\000\177\000\000\000\265/\271\367\377\177\000\000\022\000\000\000\000\000\000\000a3\271\367\377\177\000\000\240\227\273\361\377\177\000\000\227\000\000\000\000\000\000\000\260k\377\377\377\177\000\000\200\035\362\353\377\177\000\000Pp\346\353\377\177\000\000\375\003\264\367\377\177\000\000\360\063\356\353\377\177\000\000v\000\000\000\000\000\000\000\260k\377\377\377\177\000\000\200\234d\000\000\000\000\000P\035\362\353\377\177\000\000;R\275\367\377\177\000\000utf_"
        dlopenflags = <value optimized out>
#13 0x00007ffff7c4b08f in _PyImport_LoadDynamicModuleWithSpec (spec=0x7fffebedd240, fp=0x0) at ./Python/importdl.c:129
        pathbytes = 0x7fffebe67030
        name_unicode = 0x7fffebf203a0
        name = <value optimized out>
        path = 0x7fffebf21d50
        m = 0x0
        name_buf = 0x7fffebee3410 "_pywrap_tensorflow_internal"
        hook_prefix = 0x7ffff7d026a6 "PyInit"
        oldcontext = <value optimized out>
        exportfunc = <value optimized out>
        def = <value optimized out>
        p0 = <value optimized out>
#14 0x00007ffff7c4921b in _imp_create_dynamic_impl (module=Unhandled dwarf expression opcode 0xf3
) at Python/import.c:1982
        mod = 0x0
        name = 0x7fffebf203a0
---Type <return> to continue, or q <return> to quit---
        path = 0x7fffebf21d50
        fp = 0x0
#15 _imp_create_dynamic (module=Unhandled dwarf expression opcode 0xf3
) at Python/clinic/import.c.h:289
        return_value = 0x0
        spec = 0x7fffebedd240
        file = 0x0
#16 0x00007ffff7b8c1b9 in PyCFunction_Call (func=0x7ffff1bcbee8, args=0x7fffebedd2b0, kwds=Unhandled dwarf expression opcode 0xf3
) at Objects/methodobject.c:114
        f = 0x7ffff1bcbee8
        meth = 0x7ffff7c49110 <_imp_create_dynamic>
        self = 0x7ffff1bcc4a8
        arg = <value optimized out>
        res = <value optimized out>
        size = <value optimized out>
        flags = 1
#17 0x00007ffff7c2cbe8 in do_call_core (f=Unhandled dwarf expression opcode 0xf3
) at Python/ceval.c:5089
        result = <value optimized out>
        tstate = <value optimized out>
#18 _PyEval_EvalFrameDefault (f=Unhandled dwarf expression opcode 0xf3
) at Python/ceval.c:3391
        func = 0x7ffff1bcbee8
        callargs = 0x7fffebedd2b0
        kwargs = 0x7fffebedfcf0
        stack_pointer = 0x8081f0
        next_instr = 0x7ffff1bd8268
        opcode = <value optimized out>
        oparg = <value optimized out>
        why = <value optimized out>
        fastlocals = <error reading variable fastlocals (Unhandled dwarf expression opcode 0xf3)>
        freevars = <value optimized out>
        retval = 0x0
        tstate = <value optimized out>
        co = <value optimized out>
        instr_ub = -1
        instr_lb = 0
        instr_prev = -1
        first_instr = <value optimized out>
        names = <value optimized out>
        consts = <value optimized out>
        opcode_targets = {0x7ffff7c2aa60, 0x7ffff7c26ffc, 0x7ffff7c2aa17, 0x7ffff7c2a5e6, 0x7ffff7c274f9, 0x7ffff7c2749e, 0x7ffff7c2aa60, 0x7ffff7c2aa60, 0x7ffff7c2aa60, 0x7ffff7c2745f, 0x7ffff7c273df, 0x7ffff7c27365, 0x7ffff7c272d9, 
          0x7ffff7c2aa60, 0x7ffff7c2aa60, 0x7ffff7c2a389, 0x7ffff7c29b88, 0x7ffff7c27f96, 0x7ffff7c2aa60, 0x7ffff7c27ef2, 0x7ffff7c27e55, 0x7ffff7c2aa60, 0x7ffff7c27d8c, 0x7ffff7c27cc3, 0x7ffff7c27c26, 0x7ffff7c27b89, 0x7ffff7c27aec, 
          0x7ffff7c27a4f, 0x7ffff7c279b2, 0x7ffff7c27915, 0x7ffff7c2aa60 <repeats 20 times>, 0x7ffff7c27837, 0x7ffff7c2777e, 0x7ffff7c276aa, 0x7ffff7c2aa60, 0x7ffff7c2aa60, 0x7ffff7c275e1, 0x7ffff7c27544, 0x7ffff7c2723c, 0x7ffff7c2aa60, 
          0x7ffff7c2719f, 0x7ffff7c270e9, 0x7ffff7c2704f, 0x7ffff7c2992d, 0x7ffff7c29890, 0x7ffff7c26a06, 0x7ffff7c26969, 0x7ffff7c268cc, 0x7ffff7c26828, 0x7ffff7c28bc9, 0x7ffff7c28b35, 0x7ffff7c28a90, 0x7ffff7c289fa, 0x7ffff7c286e8, 
          0x7ffff7c2865e, 0x7ffff7c2aa60, 0x7ffff7c285c1, 0x7ffff7c28524, 0x7ffff7c28946, 0x7ffff7c288a9, 0x7ffff7c2880c, 0x7ffff7c28801, 0x7ffff7c26ef1, 0x7ffff7c26e3b, 0x7ffff7c289e3, 0x7ffff7c26c30, 0x7ffff7c2825d, 0x7ffff7c28231, 
          0x7ffff7c281df, 0x7ffff7c2813c, 0x7ffff7c280d7, 0x7ffff7c28033, 0x7ffff7c291ee, 0x7ffff7c290f0, 0x7ffff7c2907f, 0x7ffff7c28f99, 0x7ffff7c28ef2, 0x7ffff7c28e62, 0x7ffff7c28dd0, 0x7ffff7c2a45d, 0x7ffff7c2aa60, 0x7ffff7c2a409, 
          0x7ffff7c29a12, 0x7ffff7c299ca, 0x7ffff7c29aaa, 0x7ffff7c28d50, 0x7ffff7c28cc9, 0x7ffff7c28c43, 0x7ffff7c29f67, 0x7ffff7c2a75b, 0x7ffff7c2a990, 0x7ffff7c2a4d1, 0x7ffff7c2849a, 0x7ffff7c28413, 0x7ffff7c283bb, 0x7ffff7c2830e, 
          0x7ffff7c29265, 0x7ffff7c2a637, 0x7ffff7c2aa60, 0x7ffff7c2aa60, 0x7ffff7c2a73a, 0x7ffff7c2d1de, 0x7ffff7c26756, 0x7ffff7c26756, 0x7ffff7c2aa60, 0x7ffff7c2a57e, 0x7ffff7c2a514, 0x7ffff7c2a6c1, 0x7ffff7c29e63, 0x7ffff7c2aa60, 
          0x7ffff7c2aa60, 0x7ffff7c29e2b, 0x7ffff7c29daf, 0x7ffff7c2a034, 0x7ffff7c29f96, 0x7ffff7c2aa60, 0x7ffff7c2a322, 0x7ffff7c29310, 0x7ffff7c29c90, 0x7ffff7c29c25, 0x7ffff7c2aa60, 0x7ffff7c2aa60, 0x7ffff7c297f8, 0x7ffff7c29645, 
          0x7ffff7c29537, 0x7ffff7c2951c, 0x7ffff7c29490, 0x7ffff7c29404, 0x7ffff7c29d0a, 0x7ffff7c26aa3, 0x7ffff7c266b7, 0x7ffff7c29af5, 0x7ffff7c26b3b, 0x7ffff7c266b7, 0x7ffff7c2a157, 0x7ffff7c2a2a3, 0x7ffff7c2a1d8, 0x7ffff7c2a8c1, 
          0x7ffff7c29384, 0x7ffff7c2cd72, 0x7ffff7c2aa60 <repeats 97 times>}
#19 0x00007ffff7c2501e in _PyEval_EvalCodeWithName (_co=0x7ffff1c03db0, globals=Unhandled dwarf expression opcode 0xf3
) at Python/ceval.c:4153
        co = 0x7ffff1c03db0
        f = 0x808058
        retval = 0x0
        fastlocals = 0x8081d0
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb)

$ nvidia-smi
Thu Jul 12 10:15:53 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40m          Off  | 00000000:05:00.0 Off |                    0 |
| N/A   30C    P0    63W / 235W |      0MiB / 11439MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40m          Off  | 00000000:42:00.0 Off |                    0 |
| N/A   30C    P0    65W / 235W |      0MiB / 11439MiB |     51%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
mikeshiyang commented 6 years ago

I met similar segfault error right here. The trace back is following:

[New Thread 0x7f96ccff9700 (LWP 23574)]
[Thread 0x7f96ccff9700 (LWP 23574) exited]
[New Thread 0x7f96ccff9700 (LWP 23575)]
[Thread 0x7f96ccff9700 (LWP 23575) exited]
[New Thread 0x7f96ccff9700 (LWP 23576)]
[Thread 0x7f96ccff9700 (LWP 23576) exited]
[New Thread 0x7f96ccff9700 (LWP 23577)]
[Thread 0x7f96ccff9700 (LWP 23577) exited]
[New Thread 0x7f96ccff9700 (LWP 23578)]
[Thread 0x7f96ccff9700 (LWP 23578) exited]
[New Thread 0x7f96ccff9700 (LWP 23579)]
[Thread 0x7f96ce7fc700 (LWP 23166) exited]
[New Thread 0x7f96ce7fc700 (LWP 23580)]
[Thread 0x7f9722ffd700 (LWP 23177) exited]
[New Thread 0x7f9722ffd700 (LWP 23581)]
[Thread 0x7f96ccff9700 (LWP 23579) exited]
[New Thread 0x7f96ccff9700 (LWP 23582)]

Thread 75 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f97f5ffb700 (LWP 105)]
0x00007f9be9ecf660 in std::_Hashtable<std::string, std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*>, std::allocator<std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
(gdb) where
#0  0x00007f9be9ecf660 in std::_Hashtable<std::string, std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*>, std::allocator<std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#1  0x00007f9be9ecf705 in std::_Hashtable<std::string, std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*>, std::allocator<std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find(std::string const&) const () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#2  0x00007f9be9ecff26 in tensorflow::ResourceMgr::DoLookup(std::string const&, std::type_index, std::string const&, tensorflow::ResourceBase**) const ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#3  0x00007f9c2c26595b in tensorflow::Status tensorflow::LookupResource<tensorflow::(anonymous namespace)::Batcher>(tensorflow::OpKernelContext*, tensorflow::ResourceHandle const&, tensorflow::(anonymous namespace)::Batcher**) () from ./batcher.so
#4  0x00007f9c2c265b11 in tensorflow::(anonymous namespace)::ComputeOp::ComputeAsync(tensorflow::OpKernelContext*, std::function<void ()>) () from ./batcher.so
#5  0x00007f9bef58112f in tensorflow::Device::ComputeAsync(tensorflow::AsyncOpKernel*, tensorflow::OpKernelContext*, std::function<void ()>) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6  0x00007f9bea078622 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#7  0x00007f9bea07889a in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(tensorflow::gtl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8> const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#8  0x00007f9bea0d6e2a in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#9  0x00007f9bea0d5ed2 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#10 0x00007f9bccc19c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007f9c50a346ba in start_thread (arg=0x7f97f5ffb700) at pthread_create.c:333
#12 0x00007f9c5076a41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

System Information:

Ubuntu 16.04 CUDA 9.0 CuDNN 7 Python 2.7 tensorflow: tensorflow-gpu 1.9.0 GPU: NVIDIA TITAN V, driver: 396.24.10

@PeiliangLi Did you figure out what's the segfault coming from? I saw that we have very similar trace report.

mohantym commented 3 years ago

We see that you are using old version of Tensorflow which is officially considered as end of life, We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions .Please open a new issue in case you face any errors, we will get you the right help .Thanks!

google-ml-butler[bot] commented 3 years ago

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] commented 3 years ago

Closing as stale. Please reopen if you'd like to work on this further.