Open guilhermedemouraa opened 5 months ago
@jakaskerl would you mind checking if we can reproduce the issue?
I think this will likely fall down to a HW issue.
Sorry, I forgot to mention that I have two oaks connected to my laptop (oak0 and oak1). I have seen the same problem happening w/ both of them. Sometimes on boot I will get oak0/left but not oak1/left. Sometimes it's the other way around (I will get oak/1, but not oak/0).
The fact that it happens w/ both cameras makes me wonder if it's a hardware issue. Please let me know if there's any further information I can share to help you better understand the issue.
Can anyone please provide me with an update? @moratom @jakaskerl
Hi @guilhermedemouraa
Sorry for the late response. Hard to say what the actual issue is since the code looks ok.
I'd suggest updating to the latest depthai (2.27) and to stop using cam->setIsp3aFps(5)
since it introduces more issues than it solves.
Failed to resolve stream: Failed to find device after booting, error message: X_LINK_DEVICE_NOT_FOUND
This indicates power issue, perhaps do a recheck of the power source (injector/switch)? Change the source if you can.
If pipeline related, the issue is most likely caused by feature tracker. Not sure what configuration you are using, but we have had issues with it before. The docs pages states the supported resolutions are 480p and 720p, whereas you are using 800p.
Thanks, Jaka
Thanks for getting back to me, @jakaskerl. After some further testing, it seems that the real issue occurs when I "drop the camera" and then try to open it again. I wonder if there are recommendations/best practices for gracefully shutting down the device.
For more context, here's what I did:
I believe that somewhere in this process, the device crashed. In fact, I cannot even ping it.
Here are some logs from my service:
___$ [184430103163C41200] [10.95.76.10] [1719965092.987] [host] [warning] Device crashed, but no crash dump could be extracted.
[WARN farm_ng_stream::events::topic_manager] Failed to resolve stream: Device already closed or disconnected: Input/output error
[INFO farm_ng_stream::service::oak_manager] Opening camera 10.95.76.10
[ERROR farm_ng_stream::service::event_grpc] No matching topics: "oak/0/left"
[DEBUG hyper::proto::h2::server] send response error: user error: unexpected frame type
[DEBUG hyper::proto::h2::server] stream error: http2 error: user error: unexpected frame type
[WARN farm_ng_stream::events::topic_manager] Failed to resolve stream: Cannot find any device with given deviceInfo
[INFO farm_ng_stream::service::oak_manager] Opening camera 10.95.76.10
[ERROR farm_ng_stream::service::event_grpc] No matching topics: "oak/0/left"
[DEBUG hyper::proto::h2::server] send response error: user error: unexpected frame type
The message "Device crashed, but no crash dump could be extracted" is at least weird.
To be clear, I'm not powering it down. The power source is never touched. However my code that creates the pipeline and actively subscribes to the camera stream goes out of scope. So, to "reopen" the camera, I need to start the pipeline all over again...
I would appreciate your fast communication on this issue. We have more than 400 oak cameras at farm-ng and can't afford to have them not working properly.
I am also happy to set up an offline meeting to explain in greater detail any questions you may have.
Hi @guilhermedemouraa
Close it using device::close()
.
Perhaps the destructor is not properly called in the service.
/**
* Explicitly closes connection to device.
* @note This function does not need to be explicitly called
* as destructor closes the device automatically
*/
void close();
Thanks, Jaka
I upgraded both the depthai
SDK version (from v2.25.1
to v2.26.0
) and the bootloader version (from 0.0.18
to 0.0.28
) on both oak cameras connected to my device. The cameras seem to behave way better now!
However, from time to time I get the following error message:
Monitor thread (device: 18443010E147C31200 [10.95.76.10]) - ping was missed, closing the device connection
Despite this error, I can still ping the device from my system, and after restarting my service, I can re-establish the connection without needing to power cycle the device.
After some digging, I found that the error message I'm getting comes from the depthai's watchdog timeout. Here's a relevant code snippet I found:
// Example code snippet from DeviceBase::init2
if(watchdogTimeout > std::chrono::milliseconds(0)) {
// Watchdog thread setup
watchdogThread = std::thread([this, watchdogTimeout]() {
try {
XLinkStream stream(connection, device::XLINK_CHANNEL_WATCHDOG, 128);
std::vector<uint8_t> watchdogKeepalive = {0, 0, 0, 0};
while(watchdogRunning) {
stream.write(watchdogKeepalive);
{
std::unique_lock<std::mutex> lock(lastWatchdogPingTimeMtx);
lastWatchdogPingTime = std::chrono::steady_clock::now();
}
// Ping with a period half of that of the watchdog timeout
std::this_thread::sleep_for(watchdogTimeout / 2);
}
} catch(const std::exception& ex) {
// ignore
pimpl->logger.debug("Watchdog thread exception caught: {}", ex.what());
}
// Watchdog ended. Useful for checking disconnects
watchdogRunning = false;
});
// Monitor thread setup
monitorThread = std::thread([this, watchdogTimeout]() {
while(watchdogRunning) {
// Ping with a period half of that of the watchdog timeout
std::this_thread::sleep_for(watchdogTimeout);
// Check if wd was pinged in the specified watchdogTimeout time.
decltype(lastWatchdogPingTime) prevPingTime;
{
std::unique_lock<std::mutex> lock(lastWatchdogPingTimeMtx);
prevPingTime = lastWatchdogPingTime;
}
// Recheck if watchdogRunning wasn't already closed and close if more than twice of WD passed
if(watchdogRunning && std::chrono::steady_clock::now() - prevPingTime > watchdogTimeout * 2) {
pimpl->logger.warn("Monitor thread (device: {} [{}]) - ping was missed, closing the device connection", deviceInfo.mxid, deviceInfo.name);
// ping was missed, reset the device
watchdogRunning = false;
// close the underlying connection
connection->close();
}
}
});
}
I am looking for guidance on how to:
Thanks again @jakaskerl
On a different note, I wanted to clarify that I can't upgrade my depthai SDK version to v2.27.0
because this release relies on the latest version of hunter
, which is not compatible with aarch64-linux-gnu-gcc
.
Could you provide any information on when the new hunter
will be compatible with aarch64-linux-gnu-gcc
, or if DepthAI can create a patch for me with the latest stable version of hunter that supports aarch64-linux-gnu-gcc
?
Thank you for your assistance!
@guilhermedemouraa
WRT the watchdog, you may try increase it from 4 to 4.5s or disable it with the following env variables:
Programmatically restart the connection if it gets dropped.
This can be done manually by catching any exceptions and restarting the pipeline/program flow to again boot the device.
Any potential configurations or environmental adjustments that could help mitigate this issue.
If its not crash induced, then this likely comes from high network congestion. See if it possible to remove certain other traffic, make sure ETH is 1Gbit through the whole network, etc...
Could you provide any information on when the new hunter will be compatible with aarch64-linux-gnu-gcc, or if DepthAI can create a patch for me with the latest stable version of hunter that supports aarch64-linux-gnu-gcc?
We'd gladly take a look at this - do you have any repro steps already, which makes it fail?
Hello DepthAI team,
Thank you for the initial suggestions regarding the watchdog and network settings. I wanted to follow up with a bit more detail on how our system is structured and request further guidance on handling dropped device scenarios.
In our application, we register a callback for new frames using DepthAI's API. This callback is triggered whenever a new packet/frame is received. However, this design leaves me uncertain about how to catch exceptions or handle a dropped device event.
Specifically, I'm looking for recommendations on how to implement reconnection logic in a situation where the device is dropped, but no new frames are being received (i.e., the callback is not triggered). Since I rely heavily on the callback to process data, I am unsure where or how to catch any exceptions that might arise due to a device being dropped or encountering other issues. I found this resource online, but I don't think this could work for the callback approach.
Could you provide any examples or guidance on how to detect and handle these cases within the callback system or elsewhere in the data flow? Additionally, if there are best practices for managing reconnections, I would greatly appreciate your insights.
Thank you for your continued support.
cc @themarpe @jakaskerl
Hi @themarpe and @jakaskerl,
Just following up on my previous message. Any guidance on how to implement a robust reconnection logic using the callback approach?
Describe the bug I'm experiencing intermittent issues with streaming data from the left camera of OAK-D W PoE cameras. The left camera data is not always present, and the device sometimes fails to be recognized upon service restart. This is the log I get from my service:
Failed to resolve stream: Failed to find device after booting, error message: X_LINK_DEVICE_NOT_FOUND
Some other times, I get this log:
[1844301051BEC41200] [10.95.76.11] [1718734907.447] [host] [warning] There was a fatal error. Crash dump saved to /tmp/depthai_LR3TVl/1844301051BEC41200-depthai_crash_dump.json
I'm attaching the crash reports here, but they all seem like code bugs.
Setup Details:
Streaming Configuration:
Issues Encountered:
Intermittent Left Camera Data:
Occasionally, after rebooting the PC, the left camera stream works without any issues.
Minimal Reproducible Example It's hard to add everything here, I have a gRPC server that streams the oak topics and a python gRPC client that subscribes to them. Here's how I created my dai pipeline:
I've tried both with and without
cam->setIsp3aFps(5);
.Expected behavior I expect the left camera data to be consistently available and the device to be reliably recognized upon service restart.
Attach system log depthai_DX4bJu_1844301051BEC41200-depthai_crash_dump.json depthai_tYFDEE_18443010E147C31200-depthai_crash_dump.json depthai_Zgzmrx_1844301051BEC41200-depthai_crash_dump.json
Additional context Described above...