tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.27k stars 74.14k forks source link

Segmentation Fault (SIGSEGV) in middle of Training due to Runtime-statistics calculation ops. #22987

Closed BigBang0072 closed 5 years ago

BigBang0072 commented 5 years ago

System information

Describe the problem

The training crashes with the error Segmentation Fault (Core Dumped) in the middle of the training after around 14-18 epochs (with ~800 minibatches in each) even though I have sufficient RAM (~12-20% utilization holding steadily before crash).

screenshot from 2018-10-14 16-46-38

Possible Memory Leak Checked Initially, I suspected a memory leak due to the addition of new ops after each iteration, so I finalized the graph in the training session. But this was not the source of the problem. No new nodes were added at each iteration.

Attempt 2: Locating the problem After that, I ran the training on gdb (see the stack trace in logs section) and along with some commenting of the code I have almost pinpointed the source of error.
Currently every 30 minibatch training I am saving the summary here in the code. The full code can be found here screenshot from 2018-10-14 17-13-59

The train_track_op in line number 415 is a list of which looks like this:
[gradient_update_op, loss_GPU1, loss_GPU2, merged_summary_op]

If I comment out the section in lines 406-438 the training runs without error.

Attempt 3: Exact Location Now I comment out line 422 to 438,(the part where I save timeline and summary). I have checked there is no problem due to these lines.

Now if, I run merged_summary op along with rest of the training op and comment out line 418 and 419, i.e removing the run_options and run_metadata, the training goes without error. And if I leave these two lines uncommented the Segfault comes back.
screenshot from 2018-10-15 17-41-41

So, It seems some memory is being leaked when calculating the runtime statistics using run_options and run_metadata when doing the full trace of the graph.

Source code / logs

Stack Trace

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ff2d3fff700 (LWP 138819)] 0x00007ff2b8b00d6f in ?? () from /usr/local/cuda-9.0/extras/CUPTI/lib64/libcupti.so.9.0

(gdb) backtrace

0 0x00007ff2b8b00d6f in ?? () from /usr/local/cuda-9.0/extras/CUPTI/lib64/libcupti.so.9.0

1 0x00007ff2b8b04fd0 in ?? () from /usr/local/cuda-9.0/extras/CUPTI/lib64/libcupti.so.9.0

2 0x00007ff2b8d10c39 in ?? () from /usr/local/cuda-9.0/extras/CUPTI/lib64/libcupti.so.9.0

3 0x00007ffff77fae25 in start_thread () from /lib64/libpthread.so.0

4 0x00007ffff6e1bbad in clone () from /lib64/libc.so.6

azaks2 commented 5 years ago

CC @nluehr

This looks like an issue with CUPTI. Nathan, could you help us triage this issue?

Thanks in advance!

nluehr commented 5 years ago

@azaks2 The bug is not related to CUPTI.

The problem is that ExecutorState caches a raw pointer to the TraceCollector here. The cached copy then outlives the owning unique_ptr created in DirectSession::RunInternal().

Either a shared_ptr should be used (if multiple TraceCollectors can coexist) or a mutex is needed to keep direction_session from destroying the TraceCollector while it is in use by ExecutorState.

MikulasZelinka commented 5 years ago

I'm having the exact same problem with run_options and run_metadata on tensorflow-gpu==1.12.0 (pip), CUDA 9.0, cuDNN 7.1, CentOS Linux release 7.4.1708, Python 3.6.2.

azaks2 commented 5 years ago

@nluehr thanks for debugging this. I see your point that things are horribly broken when there are concurrent RunInternal() calls. However, with one RunInternal() call things should work since RunInternal() waits for all executors to finish.

@BigBang0072 do you have concurrent session runs?

nluehr commented 5 years ago

Not sure I see synchronization of runners in RunInternal.

In TF 1.12 I add two prints: one of (traceing::TraceCollector*)this in ~DeviceTracerImpl() after https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/core/platform/default/device_tracer.cc#L400 and the second of trace_collector after https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/core/common_runtime/executor.cc#L1578.

Then I run train_multi_gpu.py provided above and the following output is generated before triggering a SEGFAULT at https://github.com/grasseau/HAhRD/blob/master/GSOC18/train_multi_gpu.py#L443-L447.

Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Destroying DeviceTracerImpl=0x12d87c380 at tensorflow/core/platform/default/device_tracer.cc:404
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585
Using trace_collector=0x12d87c380 at tensorflow/core/common_runtime/executor.cc:1585

So the destructor seems to be called before the executor has completed.

BigBang0072 commented 5 years ago

@azaks2 No, I just initiate one session while training. You can see this here: https://github.com/grasseau/HAhRD/blob/7eacc002672da28a1f53015227497469eb2af69d/GSOC18/train_multi_gpu.py#L368

azaks2 commented 5 years ago

I was able to reproduce the problem and have a fix. The problem was the data pipeline running concurrently with the training steps.

BigBang0072 commented 5 years ago

Thanks for all your time. But we have to run the data pipeline concurrently with the training. Otherwise, it will critically slow down our performance. Also, how is data pipeline affecting the tracing of the tensorflow graph? Apart from that, if I just remove the tracing I don't get any Segmentation Fault even with the concurrent data pipeline running.

Maybe I am missing something. Could you please elaborate on what you have found?

Thanks again for your help and time.

azaks2 commented 5 years ago

I just fixed the lifetime of internal profiling related objects so they do not get GCed at the end of session.run

Khodeir-hubdoc commented 5 years ago

I'm encountering a really similar issue with multi gpu (replicate_model_fn) training and a parallel data pipeline. Several hours into the training loop, and with no warning/error message, we get a segmentation fault from tf, and the training halts. Is this a known issue and if so, is it solved in a newer release?

BigBang0072 commented 5 years ago

Dont log metadata for now, or log it very infrequently (mostly when you want to debug). I dont know about the newer release, but this will temporarily fix the problem.