luxonis / depthai-core

DepthAI C++ Library
MIT License
235 stars 127 forks source link

[BUG] Intermittent Issues with OAK-D W PoE Cameras Streaming Left Camera Data #1042

Open guilhermedemouraa opened 5 months ago

guilhermedemouraa commented 5 months ago

Describe the bug I'm experiencing intermittent issues with streaming data from the left camera of OAK-D W PoE cameras. The left camera data is not always present, and the device sometimes fails to be recognized upon service restart. This is the log I get from my service: Failed to resolve stream: Failed to find device after booting, error message: X_LINK_DEVICE_NOT_FOUND

Some other times, I get this log: [1844301051BEC41200] [10.95.76.11] [1718734907.447] [host] [warning] There was a fatal error. Crash dump saved to /tmp/depthai_LR3TVl/1844301051BEC41200-depthai_crash_dump.json

I'm attaching the crash reports here, but they all seem like code bugs.

Setup Details:

Streaming Configuration:

Issues Encountered:

Intermittent Left Camera Data:

Occasionally, after rebooting the PC, the left camera stream works without any issues.

Minimal Reproducible Example It's hard to add everything here, I have a gRPC server that streams the oak topics and a python gRPC client that subscribes to them. Here's how I created my dai pipeline:

void add_to_pipeline_mono(
    std::shared_ptr<dai::Pipeline> pipeline,
    std::shared_ptr<dai::node::MonoCamera> cam,
    const MonoStreamOptions &options,
    std::shared_ptr<dai::node::XLinkIn> xin_tracked_features_config) {

  auto xout_img = pipeline->create<dai::node::XLinkOut>();
  xout_img->setStreamName(std::string(options.queue_name));

  // As per depthai docs, reducing Isp3a rate (it defaults to capture fps)
  // reduces CPU load and seems to help with performance, especially when
  // performing feature tracking.
  cam->setIsp3aFps(5);
  cam->setBoardSocket(to_camera_board_socket(options.camera_board_socket));
  cam->setResolution(to_sensor_resolution(options.resolution));
  cam->setFps(options.fps);

  cam->out.link(xout_img->input); // Link raw output directly

  if (options.enabled && xin_tracked_features_config) {
    auto xout_tracked_features = pipeline->create<dai::node::XLinkOut>();
    xout_tracked_features->setStreamName(std::string(options.tracked_features_queue_name));

    // Set number of shaves and number of memory slices to maximum as per
    // depthai documentation to ensure good performance.
    auto feature_tracker = pipeline->create<dai::node::FeatureTracker>();
    feature_tracker->setHardwareResources(2, 2);

    cam->out.link(feature_tracker->inputImage);
    feature_tracker->outputFeatures.link(xout_tracked_features->input);

    xin_tracked_features_config->out.link(feature_tracker->inputConfig);
  }
}

I've tried both with and without cam->setIsp3aFps(5);.

Expected behavior I expect the left camera data to be consistently available and the device to be reliably recognized upon service restart.

Attach system log depthai_DX4bJu_1844301051BEC41200-depthai_crash_dump.json depthai_tYFDEE_18443010E147C31200-depthai_crash_dump.json depthai_Zgzmrx_1844301051BEC41200-depthai_crash_dump.json

Additional context Described above...

moratom commented 4 months ago

@jakaskerl would you mind checking if we can reproduce the issue?

I think this will likely fall down to a HW issue.

guilhermedemouraa commented 4 months ago

Sorry, I forgot to mention that I have two oaks connected to my laptop (oak0 and oak1). I have seen the same problem happening w/ both of them. Sometimes on boot I will get oak0/left but not oak1/left. Sometimes it's the other way around (I will get oak/1, but not oak/0).

The fact that it happens w/ both cameras makes me wonder if it's a hardware issue. Please let me know if there's any further information I can share to help you better understand the issue.

guilhermedemouraa commented 4 months ago

Can anyone please provide me with an update? @moratom @jakaskerl

jakaskerl commented 4 months ago

Hi @guilhermedemouraa Sorry for the late response. Hard to say what the actual issue is since the code looks ok. I'd suggest updating to the latest depthai (2.27) and to stop using cam->setIsp3aFps(5) since it introduces more issues than it solves.

Failed to resolve stream: Failed to find device after booting, error message: X_LINK_DEVICE_NOT_FOUND

This indicates power issue, perhaps do a recheck of the power source (injector/switch)? Change the source if you can.

If pipeline related, the issue is most likely caused by feature tracker. Not sure what configuration you are using, but we have had issues with it before. The docs pages states the supported resolutions are 480p and 720p, whereas you are using 800p.

Thanks, Jaka

guilhermedemouraa commented 4 months ago

Thanks for getting back to me, @jakaskerl. After some further testing, it seems that the real issue occurs when I "drop the camera" and then try to open it again. I wonder if there are recommendations/best practices for gracefully shutting down the device.

For more context, here's what I did:

I believe that somewhere in this process, the device crashed. In fact, I cannot even ping it.

Here are some logs from my service:

___$ [184430103163C41200] [10.95.76.10] [1719965092.987] [host] [warning] Device crashed, but no crash dump could be extracted.
[WARN  farm_ng_stream::events::topic_manager] Failed to resolve stream: Device already closed or disconnected: Input/output error
[INFO  farm_ng_stream::service::oak_manager] Opening camera 10.95.76.10
[ERROR farm_ng_stream::service::event_grpc] No matching topics: "oak/0/left"
[DEBUG hyper::proto::h2::server] send response error: user error: unexpected frame type
[DEBUG hyper::proto::h2::server] stream error: http2 error: user error: unexpected frame type
[WARN  farm_ng_stream::events::topic_manager] Failed to resolve stream: Cannot find any device with given deviceInfo
[INFO  farm_ng_stream::service::oak_manager] Opening camera 10.95.76.10
[ERROR farm_ng_stream::service::event_grpc] No matching topics: "oak/0/left"
[DEBUG hyper::proto::h2::server] send response error: user error: unexpected frame type

The message "Device crashed, but no crash dump could be extracted" is at least weird.

To be clear, I'm not powering it down. The power source is never touched. However my code that creates the pipeline and actively subscribes to the camera stream goes out of scope. So, to "reopen" the camera, I need to start the pipeline all over again...

I would appreciate your fast communication on this issue. We have more than 400 oak cameras at farm-ng and can't afford to have them not working properly.

guilhermedemouraa commented 4 months ago

I am also happy to set up an offline meeting to explain in greater detail any questions you may have.

jakaskerl commented 4 months ago

Hi @guilhermedemouraa Close it using device::close().

Perhaps the destructor is not properly called in the service.

/**
     * Explicitly closes connection to device.
     * @note This function does not need to be explicitly called
     * as destructor closes the device automatically
     */
    void close();

Thanks, Jaka

guilhermedemouraa commented 4 months ago

Update

I upgraded both the depthai SDK version (from v2.25.1 to v2.26.0) and the bootloader version (from 0.0.18 to 0.0.28) on both oak cameras connected to my device. The cameras seem to behave way better now!

However, from time to time I get the following error message:

Monitor thread (device: 18443010E147C31200 [10.95.76.10]) - ping was missed, closing the device connection

Despite this error, I can still ping the device from my system, and after restarting my service, I can re-establish the connection without needing to power cycle the device.

After some digging, I found that the error message I'm getting comes from the depthai's watchdog timeout. Here's a relevant code snippet I found:

// Example code snippet from DeviceBase::init2
if(watchdogTimeout > std::chrono::milliseconds(0)) {
    // Watchdog thread setup
    watchdogThread = std::thread([this, watchdogTimeout]() {
        try {
            XLinkStream stream(connection, device::XLINK_CHANNEL_WATCHDOG, 128);
            std::vector<uint8_t> watchdogKeepalive = {0, 0, 0, 0};
            while(watchdogRunning) {
                stream.write(watchdogKeepalive);
                {
                    std::unique_lock<std::mutex> lock(lastWatchdogPingTimeMtx);
                    lastWatchdogPingTime = std::chrono::steady_clock::now();
                }
                // Ping with a period half of that of the watchdog timeout
                std::this_thread::sleep_for(watchdogTimeout / 2);
            }
        } catch(const std::exception& ex) {
            // ignore
            pimpl->logger.debug("Watchdog thread exception caught: {}", ex.what());
        }

        // Watchdog ended. Useful for checking disconnects
        watchdogRunning = false;
    });

    // Monitor thread setup
    monitorThread = std::thread([this, watchdogTimeout]() {
        while(watchdogRunning) {
            // Ping with a period half of that of the watchdog timeout
            std::this_thread::sleep_for(watchdogTimeout);
            // Check if wd was pinged in the specified watchdogTimeout time.
            decltype(lastWatchdogPingTime) prevPingTime;
            {
                std::unique_lock<std::mutex> lock(lastWatchdogPingTimeMtx);
                prevPingTime = lastWatchdogPingTime;
            }
            // Recheck if watchdogRunning wasn't already closed and close if more than twice of WD passed
            if(watchdogRunning && std::chrono::steady_clock::now() - prevPingTime > watchdogTimeout * 2) {
                pimpl->logger.warn("Monitor thread (device: {} [{}]) - ping was missed, closing the device connection", deviceInfo.mxid, deviceInfo.name);
                // ping was missed, reset the device
                watchdogRunning = false;
                // close the underlying connection
                connection->close();
            }
        }
    });
}

Assistance Requested

I am looking for guidance on how to:

  1. Prevent the device from shutting down due to missed pings.
  2. Programmatically restart the connection if it gets dropped.
  3. Any potential configurations or environmental adjustments that could help mitigate this issue.

Thanks again @jakaskerl

guilhermedemouraa commented 4 months ago

On a different note, I wanted to clarify that I can't upgrade my depthai SDK version to v2.27.0 because this release relies on the latest version of hunter, which is not compatible with aarch64-linux-gnu-gcc.

Could you provide any information on when the new hunter will be compatible with aarch64-linux-gnu-gcc, or if DepthAI can create a patch for me with the latest stable version of hunter that supports aarch64-linux-gnu-gcc?

Thank you for your assistance!

themarpe commented 4 months ago

@guilhermedemouraa

WRT the watchdog, you may try increase it from 4 to 4.5s or disable it with the following env variables:

Programmatically restart the connection if it gets dropped.

This can be done manually by catching any exceptions and restarting the pipeline/program flow to again boot the device.

Any potential configurations or environmental adjustments that could help mitigate this issue.

If its not crash induced, then this likely comes from high network congestion. See if it possible to remove certain other traffic, make sure ETH is 1Gbit through the whole network, etc...


Could you provide any information on when the new hunter will be compatible with aarch64-linux-gnu-gcc, or if DepthAI can create a patch for me with the latest stable version of hunter that supports aarch64-linux-gnu-gcc?

We'd gladly take a look at this - do you have any repro steps already, which makes it fail?

guilhermedemouraa commented 1 month ago

Hello DepthAI team,

Thank you for the initial suggestions regarding the watchdog and network settings. I wanted to follow up with a bit more detail on how our system is structured and request further guidance on handling dropped device scenarios.

In our application, we register a callback for new frames using DepthAI's API. This callback is triggered whenever a new packet/frame is received. However, this design leaves me uncertain about how to catch exceptions or handle a dropped device event.

Specifically, I'm looking for recommendations on how to implement reconnection logic in a situation where the device is dropped, but no new frames are being received (i.e., the callback is not triggered). Since I rely heavily on the callback to process data, I am unsure where or how to catch any exceptions that might arise due to a device being dropped or encountering other issues. I found this resource online, but I don't think this could work for the callback approach.

Could you provide any examples or guidance on how to detect and handle these cases within the callback system or elsewhere in the data flow? Additionally, if there are best practices for managing reconnections, I would greatly appreciate your insights.

Thank you for your continued support.

cc @themarpe @jakaskerl

guilhermedemouraa commented 1 month ago

Hi @themarpe and @jakaskerl,

Just following up on my previous message. Any guidance on how to implement a robust reconnection logic using the callback approach?