hughperkins / tf-coriander

OpenCL 1.2 implementation for Tensorflow
Apache License 2.0
791 stars 90 forks source link

Latest github code segfaults on Ubuntu 16.04 / NVIDIA #34

Closed hughperkins closed 7 years ago

hughperkins commented 7 years ago

Using latest github code version, on Ubuntu 16.04 / nvidia, a bunch of tests pass, but every so often (quite often, unusably often), it segfaults:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 0 to device ordina
l 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1011] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0:   N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K
520, pci bus id: 0000.0000)
cl_driver DeviceAllocate 848478208
Segmentation fault (core dumped)

example backtrace, from gdb, https://gist.github.com/hughperkins/10855efd242b0786c7dfc2aa4075e59a

This looks annoyingly hard to diagnose/debug... :-(

Edit: backtrace with debug build: https://gist.github.com/hughperkins/68f636beb90fa9c8cb6d4687acce9f05

ghost commented 7 years ago

I recently ran the MNIST tutorial in Tensorflow-CL: everything workedgreat, but when I closed the Python shell session I got a segfault. If it's the same bug, perhaps it's somewhere in free-ing code, or session cleanup.

On 28 May 2017 23:13:20 GMT+01:00, Hugh Perkins notifications@github.com wrote:

Using latest github code version, on Ubuntu 16.04 / nvidia, a bunch of tests pass, but every so often (quite often, unusably often), it segfaults:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable
peer access from device ordinal 0 to device ordina
l 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1011] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0:   N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GRID K
520, pci bus id: 0000.0000)
cl_driver DeviceAllocate 848478208
Segmentation fault (core dumped)

example backtrace, from gdb, https://gist.github.com/hughperkins/10855efd242b0786c7dfc2aa4075e59a

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/hughperkins/tensorflow-cl/issues/34

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

hughperkins commented 7 years ago

@cathalgarvey So, my current theory is that, this line crashes:

      if (iu.func != nullptr) threadpool_.Schedule(iu.func);

which is effectively in a cleanup function, as you say:

  void FreeMemory(const ToFreeVector& to_free) {
    for (const auto& iu : to_free) {
      if (iu.mem != nullptr) {
        for (auto& t : *(iu.mem)) {
          t.Unref();
        }
        delete iu.mem;
      }
      if (iu.bufrec.buf) {
        if (LogMemory::IsEnabled()) {
          LogMemory::RecordRawDeallocation(iu.bufrec.operation,
                                           iu.bufrec.step_id, iu.bufrec.buf,
                                           iu.bufrec.alloc, false);
        }
        iu.bufrec.alloc->DeallocateRaw(iu.bufrec.buf);
      }
      // The function must be called in another thread.
      if (iu.func != nullptr) threadpool_.Schedule(iu.func);
    }
  }

(inside https://github.com/hughperkins/tensorflow-cl/blob/8a02ae2a3a731d2965208fb73ba143feb64a4324/tensorflow/core/common_runtime/gpu/gpu_event_mgr.h#L110-L129 )

What I think is happening is that the value of iu.func is being clobbered/overwritten by some dodgy code somewhere (possibly in my cocl_events.cpp ? ). Then when the iu.func is deleted later, it tries to delete some non-existent, invalid function, from some non-existent part of memory. And, bam! segfault :-P

My plan to fix this is:

hughperkins commented 7 years ago

@cathalgarvey Bam! An example of where the function has changed:

[MEM]  ... done cuMemcpyHtoDAsync dst=22144 src=0x2efccd0 bytes=[COCL] cuEventQuery CoclEvent=40
0x7f4c1c000d50 clevent=0x7f4c1c0020b0
[COCL] clGetEventInfo: 0
[COCL] cuEventQuery, event completed
core/common_runtime/gpu/gpu_event_mgr.h FreeMemory()
         debugui iu=0x7f4c3d7f9ae0
         debugui origfn=         debugui iu=139964864007072
         debugui pre=0x7f4c34b52e60123
         debugui post=123
         debugui &iu.func=0x7f4c35b551a0
         debugui (char *)&iu.func=�Q�5L
         debugui (long *)(char *)&iu.func=0x7f4c35b551a0
         debugui *(long *)(char *)&iu.func=139965295317440
iu=0x7f4c3d7f9ae0 origfn=139964864007072 pre=123 func=139965295317440 post=123

This is with the following debugging stuff added to the tensorflow code: https://gist.github.com/hughperkins/39ad7e00fdd3a91f1af3fb9cc7a5a744

https://gist.github.com/hughperkins/39ad7e00fdd3a91f1af3fb9cc7a5a744#file-gistfile1-txt-L69-L72

     perftools::gputools::Event* event;
     TensorReferenceVector* mem;
     BufRec bufrec;
+    long funcOrig;
+    long long pre;
     const std::function<void()> &func;
+    long long post;
   };
ghost commented 7 years ago

So, this is a double-free, possibly by race condition? Is the origin in COCL or Tensorflow itself, do you think?

More reasons to love Rust, but is this fixable in C without loads of work on locking etc?

hughperkins commented 7 years ago

I might just throw in a mutex, on FreeMemory...

hughperkins commented 7 years ago

well, mutex on FreeMemory and QueueInUse didnt fix it :-P . But at least the debug output is easier to read :-)

hughperkins commented 7 years ago

Seems the function clobber very easy to read now though. ie , it changes from 140565421235040 to 140565798691264:

[COCL] cuEventRecord CoclEvent=0x7fd7f0001470 created clevent=0x7fd7f0001ba0
         debugui iu=0x7fd8067f9e60
         debugui origfn=140565421235040
         debugui pre=123
         debugui post=123
         debugui &iu.func=0x7fd8067fa1a0
         debugui (char *)&iu.func=`
                                            debugui (long *)(char *)&iu.func=0x7fd8067fa1a0
         debugui *(long *)(char *)&iu.func=140565421235040
    queueInUse event=0x7fd7f0000bd0 iu=0x7fd8067f9e60 origfn=140565421235040 pre=123 func=140565421235040 post=123
         debugui iu=0x1614b30
         debugui origfn=140565421235040
         debugui pre=123
         debugui post=123
         debugui &iu.func=0x7fd8067fa1a0
         debugui (char *)&iu.func=`
                                            debugui (long *)(char *)&iu.func=0x7fd8067fa1a0
         debugui *(long *)(char *)&iu.func=140565421235040
    queueInUse queued iu used_events[used_events.size() - 1] iu=0x1614b30 origfn=140565421235040 pre=123 func=140565421235040 post=123 used_events_.size() 1
    queueInUse after notify_all(): used_events_.size() 1
[COCL] cuEventQuery CoclEvent=0x7fd7f0001470 clevent=0x7fd7f0001ba0
[COCL] clGetEventInfo: 3
[COCL] cuEventQuery, event not finished yet
core/common_runtime/gpu/gpu_event_mgr.h FreeMemory()
[COCL] cuEventQuery CoclEvent=0x7fd7f0001470 clevent=0x7fd7f0001ba0
[MEM]  ... done cuMemcpyHtoDAsync dst=9344 src=0x1c02ec0 bytes=[COCL] clGetEventInfo: 400
[COCL] cuEventQuery, event completed
core/common_runtime/gpu/gpu_event_mgr.h FreeMemory()

         debugui iu=0x7fd835ffaae0
         debugui origfn=140565421235040
         debugui pre=123
         debugui post=123
         debugui &iu.func=0x7fd8067fa1a0
         debugui (char *)&iu.func=���
         debugui (long *)(char *)&iu.func=0x7fd8067fa1a0
         debugui *(long *)(char *)&iu.func=140565798691264
/home/ubuntu/git/tensorflow-cl/mycopyruntest: line 6:  4531 Segmentation fault      (core dumped) python tensorflow/stream_executor/cl/test/test_binary_ops.py
hughperkins commented 7 years ago

(nuance, the change in the function object might just be because of the std::move ...

https://github.com/hughperkins/tensorflow-cl/blob/tensorflow-cl/tensorflow/core/lib/core/threadpool.cc#L128

void ThreadPool::Schedule(std::function<void()> fn) {
  // std::cout << "ThreadPool::Schedule()" << std::endl;
  CHECK(fn != nullptr);
  impl_->Schedule(std::move(fn));
}
hughperkins commented 7 years ago

Made a diagram of the EventMgr system:

screen shot 2017-05-29 at 8 41 54 pm

Example debug output: https://gist.github.com/hughperkins/09110acfde7bcf550ee7c2f8dc56ebed

hughperkins commented 7 years ago

^^^ turns out this then crashes on Mac... :-(

hughperkins commented 7 years ago

Fixed in https://github.com/hughperkins/tf-coriander/compare/8a02ae2a3a731...af9f284bfa27deae . Yay .

ghost commented 7 years ago

When you put it that way, it sounds like the error was coming from all those debugging statements! :D

hughperkins commented 7 years ago

Haha :-) . Yes, tats true

hughperkins commented 7 years ago

I mean, its true that it gives that impression

hughperkins commented 7 years ago

updated a bit

ghost commented 7 years ago

Sorry, I didn't mean to make that sound like a complaint or criticism. Just laughing at myself for looking at all the commented-out print statements with confusion for a few seconds, before clicking the parent. :)

hughperkins commented 7 years ago

No, you're right. I was thinking the same thing. It was like 2am or so though, and I wanted to get the wheels out. I think the diff is a bit more informative now. There's still a bunch of cleaning I should do though, in that struct.

hughperkins commented 7 years ago

(hmmm, maybe the diff is not quite right. let me check that...)

hughperkins commented 7 years ago

(updated the diff. seems the second checksum should be the commit after the final change in fact)

hughperkins commented 7 years ago

In case you're interested, I did the following changes:

ghost commented 7 years ago

As related in #35, the most recent release no longer appears to segfault when closing an iPython session or running tests, for me on Ubuntu 16.04 with an R9 390 with AMDGPU-pro driver.

hughperkins commented 7 years ago

cool :-)