google-coral / libedgetpu

Source code for the userspace level runtime driver for Coral.ai devices.
Apache License 2.0
181 stars 62 forks source link

Possible deadlock #9

Open YijinLiu opened 3 years ago

YijinLiu commented 3 years ago

My program is experiencing deadlock in libedgetpu occasionally. Following is the stack trace.

Thread 58 (Thread 0x7fff7d7fa700 (LWP 24554)):
#0  0x00007ffff75ebad3 in futex_wait_cancelable (private=, expected=0, futex_word=0x7fff7d7f9170) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fff7d7f9120, cond=0x7fff7d7f9148) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7fff7d7f9148, mutex=0x7fff7d7f9120) at pthread_cond_wait.c:655
#3  0x00007ffff78b48bc in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffea8f9a0e in std::condition_variable::wait >(std::unique_lock &, platforms::darwinn::BlockingCounter::) (this=0x7fff7d7f9148, 
    __lock=..., __p=...) at /usr/include/c++/5/condition_variable:98
#5  0x00007fffea8f99c4 in platforms::darwinn::BlockingCounter::Wait (this=0x7fff7d7f9120) at port/blocking_counter.cc:42
#6  0x00007fffea8ce763 in platforms::darwinn::driver::Driver::Execute (this=0xc1a6700, request=std::shared_ptr (empty) = {...}) at driver/driver.cc:593
#7  0x00007fffea8b9f0a in platforms::darwinn::tflite::EdgeTpuDriverWrapper::InvokeExecutable (this=0xbdf57a0, context=0xc087628, node=0xc13f088) at tflite/edgetpu_context_direct.cc:240
#8  0x00007fffea8ac653 in platforms::darwinn::tflite::(anonymous namespace)::CustomOpInvoke (context=0xc087628, node=0xc13f088) at tflite/custom_op_direct.cc:122
#9  0x00000000068df998 in tflite::impl::Subgraph::Invoke() ()
#10 0x0000000006830e87 in tflite::impl::Interpreter::Invoke() ()

Thread 32 (Thread 0x7fffb4cf5700 (LWP 24515)):
#0  0x00007ffff75ebad3 in futex_wait_cancelable (private=, expected=0, futex_word=0xc1a6a14) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0xc19a4b0, cond=0xc1a69e8) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0xc1a69e8, mutex=0xc19a4b0) at pthread_cond_wait.c:655
#3  0x00007ffff78b48bc in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffea87517f in std::_V2::condition_variable_any::wait (this=0xc1a69e8, __lock=...) at /usr/include/c++/5/condition_variable:250
#5  0x00007fffea871789 in platforms::darwinn::driver::UsbDriver::WorkerThreadFunc (this=0xc1a6700) at driver/usb/usb_driver.cc:1322
#6  0x00007fffea8728c7 in platforms::darwinn::driver::UsbDriver::::operator()(void) const (__closure=0xc1cb358) at driver/usb/usb_driver.cc:1604
#7  0x00007fffea878834 in std::_Bind_simple()>::_M_invoke<>(std::_Index_tuple<>) (this=0xc1cb358) at /usr/include/c++/5/functional:1531
#8  0x00007fffea8787b5 in std::_Bind_simple()>::operator()(void) (this=0xc1cb358) at /usr/include/c++/5/functional:1520
#9  0x00007fffea87875c in std::thread::_Impl()> >::_M_run(void) (this=0xc1cb340) at /usr/include/c++/5/thread:115
#10 0x00007ffff78baa50 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007ffff75e56db in start_thread (arg=0x7fffb4cf5700) at pthread_create.c:463
#12 0x00007fffea0d171f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I use one thread to run the inference. This only happens about once per week. I haven't find a way to repro it so far.

Namburger commented 3 years ago

Hello, can you describe what exactly happens and how is your program designed?

YijinLiu commented 3 years ago

We run a SSD model using edgetpu. A single thread receives images and run the model. Once a while, the thread stuck forever. The initial post has the stack trace. Unfortunately, I have no luck finding a way to repro it so far. Let me know if you need more information.

Namburger commented 3 years ago

@YijinLiu I see, may I ask if this is over usb3 or usb2?

YijinLiu commented 3 years ago

It's usb3. Thanks.

manoj7410 commented 3 years ago

Feel free to reopen if this is still an issue.

YijinLiu commented 2 years ago

Yes, it is. This is awkward.. I answered one of your questions and wait for several months. Only to find you closed it w/o giving any reason. I hope this is not how you guys usually interact with developers.

Anyway, I have bit more information:

I turned on logging and got a lot of errors like:

I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/usb_driver.cc:468] USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/usb_driver.cc:468] USB transfer error 2 [LibUsbDataInCallback]

And the thread was stuck here:

#0  futex_wait_cancelable (private=, expected=0, futex_word=0xc95660c) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xcf13f90, cond=0xc9565e0) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0xc9565e0, mutex=0xcf13f90) at pthread_cond_wait.c:638
#3  0x00007ffff7a610fc in std::condition_variable::wait(std::unique_lock&) () from bin/libstdc++.so.6
#4  0x00007fffeb8fca38 in std::_V2::condition_variable_any::wait (this=0xc9565e0, __lock=...) at /usr/include/c++/7/condition_variable:263
#5  0x00007fffeb8f82cc in platforms::darwinn::driver::UsbDriver::WorkerThreadFunc (this=0xc956300) at driver/usb/usb_driver.cc:1320
#6  0x00007fffeb8f9a15 in platforms::darwinn::driver::UsbDriver::::operator()(void) const (__closure=0xce69628) at driver/usb/usb_driver.cc:1599
#7  0x00007fffeb8fe28d in std::__invoke_impl >(std::__invoke_other, platforms::darwinn::driver::UsbDriver:: &&) (__f=...) at /usr/include/c++/7/bits/invoke.h:60
#8  0x00007fffeb8fcbaf in std::__invoke >(platforms::darwinn::driver::UsbDriver:: &&) (__fn=...) at /usr/include/c++/7/bits/invoke.h:95
#9  0x00007fffeb8ffba6 in std::thread::_Invoker > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0xce69628) at /usr/include/c++/7/thread:234
#10 0x00007fffeb8ffb77 in std::thread::_Invoker > >::operator()(void) (this=0xce69628) at /usr/include/c++/7/thread:243
#11 0x00007fffeb8ffb56 in std::thread::_State_impl > > >::_M_run(void) (this=0xce69620) at /usr/include/c++/7/thread:186
#12 0x00007ffff7a66a00 in ?? () from bin/libstdc++.so.6
#13 0x00007ffff796a609 in start_thread (arg=) at pthread_create.c:477
#14 0x00007fffeb723293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
YijinLiu commented 2 years ago

Well, even more weird: I am not allowed to reopen it!

manoj7410 commented 2 years ago

@YijinLiu Issue reopened. Can you please share any steps to reproduce ? Which platform are you working on ?

YijinLiu commented 2 years ago

I cannot find a way to repro it. We have a Linux program running detection on security cameras all the time. It may happen once every week.

Do you think these "USB transfer error" are normal?

manoj7410 commented 2 years ago

@YijinLiu What is the power source that you are using to boot your machine/platform ? How many USB Accelerators are connected with the machine ? Is there any other USB device connected as well ?

YijinLiu commented 2 years ago

It's a desktop machine, with CPU i7-10700, running ubuntu 20, with kernel 5.11. I only have one USB accelerator connected to the machine.

We have users with many other configurations that are experiencing the same issue.

manoj7410 commented 2 years ago

I'll try to run a demo on my machine which has Core i7. Will let you know if I am able to repro this issue at my end. Meanwhile, if you can find some specific steps to repro this then that will be really helpful.

YijinLiu commented 2 years ago

I think I found a bug in the code. I am not sure whether it's the cause for this deadlock though. Testing it now, we will know after a week..

diff --git a/driver/usb/usb_driver.cc b/driver/usb/usb_driver.cc
index 78beda4..1dcac42 100644
--- a/driver/usb/usb_driver.cc
+++ b/driver/usb/usb_driver.cc
@@ -1299,7 +1299,7 @@ void UsbDriver::WorkerThreadFunc() {
         }
       }

-      reevaluation_needed = ProcessIo().ValueOrDie();
+      if (ProcessIo().ValueOrDie()) reevaluation_needed = true;

       // TODO: Enter kPaused state when dma_scheduler_.IsEmpty(). Any
       // new task should kick the driver back to kOpen state. Note this is in
@@ -1311,7 +1311,7 @@ void UsbDriver::WorkerThreadFunc() {
     } else {
       StdCondMutexLock queue_lock(&callback_mutex_);

-      Lock2 unlock_both(state_lock, queue_lock);
+      Lock2 unlock_both(queue_lock, state_lock);

       if (callback_queue_.empty()) {
         VLOG(10) << StringPrintf("%s waiting on state change", __func__);

YijinLiu commented 2 years ago

There is another issue, a likely contention:

manoj7410 commented 2 years ago

Thank you for sharing these details. What all files were modified to solve the issue ?

chao-camect commented 2 years ago

Besides the diffs I posted previously, here are two extra changes:

diff --git a/driver/beagle/beagle_usb_driver_provider.cc b/driver/beagle/beagle_usb_driver_provider.cc
index dcfd804..2b82a00 100644
--- a/driver/beagle/beagle_usb_driver_provider.cc
+++ b/driver/beagle/beagle_usb_driver_provider.cc
@@ -105,7 +105,7 @@ ABSL_FLAG(bool, usb_enable_bulk_descriptors_from_device,
 ABSL_FLAG(bool, usb_enable_processing_of_hints,
           GetEnv("USB_ENABLE_PROCESSING_OF_HINTS", true),
           "USB set to true for driver to proactively send data to device.");
-ABSL_FLAG(int, usb_timeout_millis, GetEnv("USB_TIMEOUT_MILLIS", 6000),
+ABSL_FLAG(int, usb_timeout_millis, GetEnv("USB_TIMEOUT_MILLIS", 0),
           "USB timeout in milliseconds");
 ABSL_FLAG(bool, usb_reset_back_to_dfu_mode,
           GetEnv("USB_RESET_BACK_TO_DFU_MODE", false),
@@ -135,7 +135,7 @@ ABSL_FLAG(bool, usb_enable_overlapping_requests,
           "the current one.");
 ABSL_FLAG(bool, usb_enable_overlapping_bulk_in_and_out,
           GetEnv("USB_ENABLE_OVERLAPPING_BULK_IN_AND_OUT", true),
-          "Allows bulk-in trasnfer to be submitted before previous bulk-out "
+          "Allows bulk-in transfer to be submitted before previous bulk-out "
           "requests complete.");
 ABSL_FLAG(bool, usb_enable_queued_bulk_in_requests,
           GetEnv("USB_ENABLE_QUEUED_BULK_IN_REQUESTS", true),
diff --git a/driver/usb/usb_driver.h b/driver/usb/usb_driver.h
index 182f27e..d3580a9 100644
--- a/driver/usb/usb_driver.h
+++ b/driver/usb/usb_driver.h
@@ -151,7 +151,7 @@ class UsbDriver : public Driver {
     bool usb_fail_if_slower_than_superspeed{false};

     // General timeout for USB operations in milliseconds.
-    int usb_timeout_millis{6000};
+    int usb_timeout_millis{0};

     // If non-empty, the firmware image to use for automatic DFU.
     // This feature is only available when a device factory has been supplied.
manoj7410 commented 2 years ago

@YijinLiu Are these changes made to the generic Linux drivers?

chao-camect commented 2 years ago

It's libedgetpu. You can find the filenames in the diff...

manoj7410 commented 2 years ago

@chao-camect Got it. We are still not able to reproduce this. However, can you create a patch of the complete code change and submit for review?

chao-camect commented 2 years ago

To repro it, you'll need to add random sleep time between two calls to USB Coral. Since the default timeout is 6s, the sleep time need to be around that to replicate. I'll find some time to create the PR.

xxs1989 commented 1 year ago

Hey, I got the same issue occasional happened on my coral mini as :

I driver/usb/usb_driver.cc:86] lock (does nothing)
I driver/usb/usb_driver.cc:1322] WorkerThreadFunc driver state change detected
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 1 callback events in worker thread
I driver/usb/usb_driver.cc:1138] bulk in 8 bytes from buffer index [5]
I driver/dma_chunker.cc:30] Completed 0 bytes; Outstanding 0 bytes; Processing next 8 bytes
I driver/usb/usb_driver.cc:916] [3-3] bulk in for 8 bytes has yielded 8 bytes from index [5]
I driver/usb/usb_driver.cc:1310] WorkerThreadFunc re-evaluation is needed
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 0 callback events in worker thread
I driver/usb/usb_driver.cc:1260] WorkerThreadFunc Installing bulk-in reader. buffer index [5]
I driver/usb/local_usb_device.cc:748] AsyncBulkInTransfer
I driver/usb/local_usb_device.cc:761] ASYNC IN 1 begin
I driver/single_queue_dma_scheduler.cc:154] Completing DMA[3]
I driver/usb/usb_driver.cc:550] IO completed
I driver/single_queue_dma_scheduler.cc:154] Completing DMA[4]
I driver/single_tpu_request.cc:410] [313] NotifyCompletion()
I driver/package_registry.cc:658] Returned instruction buffers back to executable reference
I driver/single_tpu_request.cc:478] [313] SetState old=3, new=4.
I driver/single_queue_dma_scheduler.cc:234] Request[313]: Completed
I driver/single_tpu_request.cc:96] [313] Request destroyed.
I driver/usb/usb_driver.cc:550] IO completed
I driver/usb/usb_driver.cc:1317] WorkerThreadFunc waiting on state change
I driver/usb/usb_driver.cc:91] Unlocks both mutex
I driver/usb/local_usb_device.cc:672] ASYNC IN 3 end

I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:642] UnregisterCompletedTransfer
I driver/usb/usb_driver.cc:86] lock (does nothing)
I driver/usb/usb_driver.cc:1322] WorkerThreadFunc driver state change detected
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 1 callback events in worker thread

I driver/usb/usb_driver.cc:468] USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/usb_driver.cc:1232] WorkerThreadFunc Re-installing interrupt reader
I driver/usb/local_usb_device.cc:785] AsyncInterruptInTransfer
I driver/usb/local_usb_device.cc:798] ASYNC IN 3 begin
I driver/usb/usb_driver.cc:1317] WorkerThreadFunc waiting on state change
I driver/usb/usb_driver.cc:91] Unlocks both mutex
I driver/usb/local_usb_device.cc:672] ASYNC IN 1 end

I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:642] UnregisterCompletedTransfer
I driver/usb/usb_driver.cc:86] lock (does nothing)
I driver/usb/usb_driver.cc:1322] WorkerThreadFunc driver state change detected
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 1 callback events in worker thread
I driver/usb/usb_driver.cc:1260] WorkerThreadFunc Installing bulk-in reader. buffer index [6]

And I am using python to run the 'edgetpu.run_inference()', the behavior is pretty un-consistent, seems have relation with different USB power supply, so is there any USB power requirements for the coral mini running tpu? please give any debug or solution advice, thanks!

hjonnala commented 1 year ago

Hi @xxs1989 the board requires 5V/2A power supply. Please check the section 5.1 for more details at: https://coral.ai/static/files/Coral-Dev-Board-Mini-datasheet.pdf. Thanks!