PoE device intolerant to repeated connections, drops offline until power cycle #547

Open gberardinelli opened 2 years ago

gberardinelli commented 2 years ago

This is possibly related to #415 but I've been unable to run @diablodale's repro code so far.

I use depthai-python to demonstrate below, but similar issues manifest when using c++ from depthai-ros-examples.


Our PoE device frequently goes offline entirely, becoming unresponsive to pings and unable to be found by depthai. This generally seems to occur when restarting depthai processes on the host machine. Killing the process less-than-gracefully increases the frequency of the issue, so perhaps it's related to whatever teardown mechanism is built in.

Additionally, it often seems to hang after device discovery until it's tried again. It remains ping-able in these cases.



Run a pipeline such as this, wait for it to spin up, send it SIGKILL, and repeat.

After a number of times, the device will no longer be found and will not respond to pings.

Here's some code to make that process slightly more scientific. For me this will trigger the issue after about 5-10 iterations.

#!/usr/bin/env python3

import multiprocessing as mp
import os
import time

import signal

os.environ['DEPTHAI_LEVEL'] = 'debug'

import depthai as dai

def run_pipeline():
    pipeline = dai.Pipeline()

    # borrowed from depthai-python/examples/IMU/
    imu = pipeline.create(dai.node.IMU)
    xlink_out = pipeline.create(dai.node.XLinkOut)
    imu.enableIMUSensor(dai.IMUSensor.ACCELEROMETER_RAW, 500)
    imu.enableIMUSensor(dai.IMUSensor.GYROSCOPE_RAW, 400)

    print('Connecting to device...')
    with dai.Device(pipeline) as device:
        print('Connected, reading IMU data')
        while True:
            q = device.getOutputQueue(name="imu", maxSize=50, blocking=False)
            _ = q.get()

def main():
    for i in range(1, 100):
        print(f'Running process for 30 seconds ({i})')

        p = mp.Process(target=run_pipeline)  # start the pipeline in a separate process
        p.join(timeout=30)  # wait for up to 30 seconds or until the process dies

        if p.exitcode is None:
            # it lasted >30 seconds, was either running successfully or hung up
            print('Process still running, sending SIGKILL')
            os.kill(, signal.SIGKILL)  # hard-kill the pipeline and try again
            # it threw an error in <30 seconds, it's probably offline now
            print('Process died on its own, exiting')


if __name__ == '__main__':

Running the above most recently resulted in:

  1. A hang after Resources - Archive 'depthai-device-fwp-1cfa51d59471f57c43cfef2b69205c227c272958.tar.xz' open
  2. Successful pipeline execution
  3. A hang after Searching for device: DeviceInfo(name=, mxid=18443010017EF50800, X_LINK_BOOTED, X_LINK_TCP_IP, X_LINK_MYRIAD_X, X_LINK_SUCCESS)
  4. A hang after Searching for device: " "
  5. Successful pipeline execution
  6. A hang after Searching for device: " "
  7. A hang after Searching for device: " "
  8. RuntimeError: No available devices. Can no longer ping the device.

Full output attached: oakd_poe_test_killed_pipeline.txt

I get similar behavior and ultimately the same outcome every time I run the above.

Luxonis-Brandon commented 2 years ago

Sorry about the trouble. We think we have some fixes here for this, we're triaging internally.

And thanks for the reproducible example.

gberardinelli commented 2 years ago

No problem.

Small update: still seeing the same issue when plugged into my laptop via PoE injector (as opposed to the automotive/industrial-style PC with integrated PoE used above).

And I believe this is actually one of the pre-production models if there's any difference there.

gberardinelli commented 2 years ago

Another update: we just received a unit from the regular production run. It experiences the same types of hang-ups during init which require re-trying, but it never goes offline entirely and can always be pinged.

Luxonis-Brandon commented 2 years ago

Thanks for the data and sorry again about the issue here. We think we have solved this but with the solution an entirely new (and worse) but is introduced so we're digging at that now.

themarpe commented 2 years ago

Hi @gberardinelli

Sorry for late response on this.

Just to confirm - using a regular production unit you do not require powercycling the device in any case? (eg getting it into that state?)

As far as the BL/SW version, the initial data points, seem to be using the version with these issues addressed. One other thing would be to try with latest library (v.2.17.3).

gberardinelli commented 2 years ago

No worries.

That's correct, our two regular production units have been very reliable and have never required a power cycle. Once in a while they take a couple of attempts at starting a pipeline before they get picked up, but that's not a big deal.

Regarding the pre-prod model, I just installed the latest depthai-python (which seems to be 2.17.4) and re-ran the above code. It still falls offline in the same way as before-- not pingable and needs a power cycle.

themarpe commented 2 years ago

Thanks @gberardinelli

Do you mind letting me know the MXID of the pre-production unit? Also, we'd love to check this specific case out - could you reach out to mentioning this issue and we'd swap the pre-production unit with a new one, and we'll analyze the behavior of your unit locally.

As an additional datapoint - the same host & networking equipment was used for both tests of pre and production units?

gberardinelli commented 2 years ago

Sounds good to me, thank you. I'll be in touch via email.

MXID: 18443010017EF50800

And yes indeed, same network hardware. I actually tested both pre and prod devices on two different network setups, described in this thread. Only the pre-production unit had issues on both setups.

hrfuller commented 1 year ago

I'm using an OAK-D W PoE camera and experiencing some similar sounding issues. I can open a new ticket if you think it would be more useful @Luxonis-Brandon.

The symptoms come after a handful of pipeline runs, the exact number seems to vary, creating a new dai.Device() fails with RuntimeError: No available devices I can no longer ping the device after that.

My setup is:

>>> d.getBootloaderVersion().toStringSemver()
>>> dai.__version__
hrfuller commented 1 year ago

I think I figured out a workaround. I was using a context manager to close the device, but also a signal handler to close the same device on SIGINT or SIGTERM. These two device closing methods seem not to play nicely, so removing the context manager in favor of the signal handler seems to close the device in a way where I can re-attach to it in subsequent runs without power cycling.

RupeshJey commented 1 year ago

Hello! I have been having this issue as well when working with two OAK-D PoE's. Not sure if there's any resolution I might have missed? Happy to provide information if needed.

For now, I have been hacking a solution by doing:

device = None

    device = self.stack.enter_context(dai.Device(device_info))
    print(f"Retrying connection to {device_name}... ")
    device = self.stack.enter_context(dai.Device(device_info))

This fails in the try clause about 50% of the time but seemingly always works on the second try in the except clause. Obviously not an ideal solution, but would appreciate any guidance to resolve this issue.

TobyUllrich commented 11 months ago
Hello, We have 3 OKA D POE cameras, some are PoE W Pro and some just PoE W, but we are experiencing this issue on some of them pros and non-pros, here are the IDs: MXID Model State
1844301091EB9C0F00 OAK D PoE W Recovers
1944301041407E1300 OAK D PoE W Pro Recovers
1944301051407E1300 OAK D PoE W Recovers
18443010C1385D0F00 OAK D PoE W Pro Does not Recover
1844301021969E0F00 OAK D PoE W Pro Does not Recover
1844301071229D0F00 OAK D PoE W Pro Recovers

As we are using the cameras on autonomous mobile robots without a human onsite we need them to work reliably when the software starts. Due to OS compatibility, we are running the code for the cameras in a Docker container but the issue also persists when running the code directly on the system. The PoE router we are using is the Netgear GS308EP PoE Switch. Is it possible to fix this issue in software or do we need to get our broken hardware exchanged? If the latter applies who do I contact?

The code we are using looks like this, but we also tested different version and it leads to the same result.

from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter, BooleanOptionalAction
import subprocess as sp
import depthai as dai

# # Parse command line arguments
parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
parser.add_argument("-rip", "--rtsp_host", default="localhost", type=str,   help="Host of the RTSP Server")
parser.add_argument("-rp",  "--rtsp_port", default=8554,        type=int,   help="Port of the RTSP Server")
parser.add_argument("-qa",  "--quality",   default=100,         type=int,   help="Video quality, from 1 to 100")
parser.add_argument("-ip",  "--ip_camera", default=None,                    help="set the IP of the used Camera")
parser.add_argument("-c",   "--color",     default="color",                 help="set the color mode of the camera")
parser.add_argument("-fps", "--frame_rate", default=30, type=int, help="Set Framerate of the Stream")
parser.add_argument("-n",   "--name",     default="main",                 help="set stream endpoint")
parser.add_argument("-ss",   "--stream_select",     default="main",                 help="set stream endpoint")
bool_args = parser.parse_args()
args = vars(parser.parse_args())

CAMERA_IP = args["ip_camera"]
FPS = args["frame_rate"]
QUALITY   = bool_args.quality
RTSP_HOST = args["rtsp_host"]
RTSP_PORT = args["rtsp_port"]
NAME = args["name"]
COLOR = args["color"]
STREAM_SELECT = args["stream_select"]

def setupPipeline():
    pipeline = dai.Pipeline()

    # Color Pipeline
    if COLOR == "color":
        colorCam = pipeline.create(dai.node.ColorCamera)
        colorCam.setPreviewSize(1920, 1080)
        colorCam.setVideoSize(1920, 1080)

        # Main Stream
        if STREAM_SELECT == "main":
            colorVidEnc = pipeline.create(dai.node.VideoEncoder)
            colorVidEnc.setDefaultProfilePreset(FPS, dai.VideoEncoderProperties.Profile.H264_MAIN)
            colorVidEnc.setKeyframeFrequency(FPS * 4)

            colorVeOut = pipeline.create(dai.node.XLinkOut)

        # Sub Stream 360p
        if STREAM_SELECT == "sub":
            imgScale = pipeline.create(dai.node.ImageManip)

            colorVidEnc1 = pipeline.create(dai.node.VideoEncoder)
            colorVidEnc1.setDefaultProfilePreset(FPS, dai.VideoEncoderProperties.Profile.H264_MAIN)
            colorVidEnc1.setKeyframeFrequency(FPS * 4)

            colorVeOut1 = pipeline.create(dai.node.XLinkOut)

    # Mono Pipeline
    if COLOR == "mono":
        # Main Stream
        monoCam = pipeline.create(dai.node.MonoCamera)

        if STREAM_SELECT == "main":
            monoVidenc = pipeline.create(dai.node.VideoEncoder)
            monoVidenc.setDefaultProfilePreset(FPS, dai.VideoEncoderProperties.Profile.H264_MAIN)
            monoVidenc.setKeyframeFrequency(FPS * 4)

            monoVeOut = pipeline.create(dai.node.XLinkOut)

        if STREAM_SELECT == "sub":
            # Sub Stream
            monoImageManip = pipeline.create(dai.node.ImageManip)

            monoVidencSub = pipeline.create(dai.node.VideoEncoder)
            monoVidencSub.setDefaultProfilePreset(FPS, dai.VideoEncoderProperties.Profile.H264_MAIN)
            monoVidencSub.setKeyframeFrequency(FPS * 4)

            monoVeOutSub = pipeline.create(dai.node.XLinkOut)

    return pipeline

def getCameraInfo(cam_ip):
    device_infos = dai.Device.getAllAvailableDevices()
    if len(device_infos) == 100:
        raise RuntimeError("No DepthAI device found!")
        print("Available devices:")
        for i, info in enumerate(device_infos):
            print(f"[{i}] {info.getMxId()} [{}]")
            device = dai.DeviceInfo(cam_ip)
            raise ValueError("Incorrect value supplied: {}".format(cam_ip))
    return device

if __name__ == "__main__":

    pipeline = setupPipeline()
    device_info = getCameraInfo(CAMERA_IP)

    if COLOR == "color" and STREAM_SELECT == "main":
        color_main_stream_cmd = [
            "-probesize", "100M",
            "-i", "-",
            "-f", "rtsp",
            "-rtsp_transport", "udp",
            "-framerate", str(FPS),
            "-vcodec", "copy",
            "-v", "error",

    if COLOR == "color" and STREAM_SELECT == "sub":
        color_sub_stream_cmd = [
            "-probesize", "100M",
            "-i", "-",
            "-f", "rtsp",
            "-rtsp_transport", "udp",
            "-framerate", str(FPS),
            "-vcodec", "copy",
            "-v", "error",

    if COLOR == "mono" and STREAM_SELECT == "main":
        mono_main_stream_cmd = [
            "-probesize", "100M",
            "-i", "-",
            "-f", "rtsp",
            "-rtsp_transport", "udp",
            "-framerate", str(FPS),
            "-vcodec", "copy",
            "-v", "error",

    if COLOR == "mono" and STREAM_SELECT == "sub":
        mono_sub_stream_cmd = [
            "-probesize", "100M",
            "-i", "-",
            "-f", "rtsp",
            "-rtsp_transport", "udp",
            "-framerate", str(FPS),
            "-vcodec", "copy",
            "-v", "error",

        if COLOR == "color" and STREAM_SELECT == "main":
            color_main_rtsp_proc = sp.Popen(color_main_stream_cmd, stdin=sp.PIPE)  # Start the ffmpeg process
        if COLOR == "color" and STREAM_SELECT == "sub":
            color_sub_rtsp_proc = sp.Popen(color_sub_stream_cmd, stdin=sp.PIPE)  # Start the ffmpeg process
        if COLOR == "mono" and STREAM_SELECT == "main":
            mono_main_rtsp_proc = sp.Popen(mono_main_stream_cmd, stdin=sp.PIPE)  # Start the ffmpeg process
        if COLOR == "mono" and STREAM_SELECT == "sub":
            mono_sub_rtsp_proc = sp.Popen(mono_sub_stream_cmd, stdin=sp.PIPE)  # Start the ffmpeg process
        exit("Error: cannot run ffmpeg!\nTry running: sudo apt install ffmpeg")

    with dai.Device(pipeline, device_info) as device:
        if COLOR == "color" and STREAM_SELECT == "main":
            colorMainEncoded = device.getOutputQueue("colorEncodedMain", maxSize=40, blocking=True)
        if COLOR == "color" and STREAM_SELECT == "sub":
            colorSubEncoded = device.getOutputQueue("colorEncodedSub", maxSize=40, blocking=True)
        if COLOR == "mono" and STREAM_SELECT == "main":
            monoMainEncoded = device.getOutputQueue("monoEncodedMain", maxSize=40, blocking=True)
        if COLOR == "mono" and STREAM_SELECT == "sub":
            monoSubEncoded = device.getOutputQueue("monoEncodedSub", maxSize=40, blocking=True)

        if COLOR == "color" and STREAM_SELECT == "main":
            print(f"Setup finished, RTSP stream available under \"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/color_main\"")
        if COLOR == "color" and STREAM_SELECT == "sub":
            print(f"Setup finished, RTSP stream available under \"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/color_sub\"")
        if COLOR == "mono" and STREAM_SELECT == "main":
            print(f"Setup finished, RTSP stream available under \"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/mono_main\"")
        if COLOR == "mono" and STREAM_SELECT == "sub":
            print(f"Setup finished, RTSP stream available under \"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/mono_sub\"")

            while True:
                if COLOR == "color" and STREAM_SELECT == "main":
                    data1 = colorMainEncoded.get().getData()  # Blocking call, will wait until new data has arrived
                if COLOR == "color" and STREAM_SELECT == "sub":
                    data2 = colorSubEncoded.get().getData()  # Blocking call, will wait until new data has arrived
                if COLOR == "mono" and STREAM_SELECT == "main":
                    data3 = monoMainEncoded.get().getData()  # Blocking call, will wait until new data has arrived
                if COLOR == "mono" and STREAM_SELECT == "sub":
                    data4 = monoSubEncoded.get().getData()  # Blocking call, will wait until new data has arrived

        if COLOR == "color" and STREAM_SELECT == "main":
        if COLOR == "color" and STREAM_SELECT == "sub":
        if COLOR == "mono" and STREAM_SELECT == "main":
        if COLOR == "mono" and STREAM_SELECT == "sub":  
Erol444 commented 11 months ago

Hi @TobyUllrich , What's the depthai version you are using and bootloader version that's flashed on the device? The script you sent isn't very minimal, so it doesn't help with the debugging process.

TobyUllrich commented 11 months ago

Depthai version: Bootloader: 0.0.22 and 0.0.21 The cameras that dont work are 0.0.22 but some cameras with 0.0.22 work as well

If you can supply me with a more minimal pipeline I should test, then I can do so.

What we are trying to do with the cameras is to get the encoded stream of a selected camera as an RTSP stream, this works fine with all cameras but when stopping the python program some of the cameras are not reachable through the network anymore, thus needing a powercycle to work again

Erol444 commented 11 months ago

@TobyUllrich could you try flashing the latest bootlaoder version (0.26) and using the latest depthai (2.23)?

themarpe commented 11 months ago

@TobyUllrich do you mind also sending over the calibration_dump results (examples/calibration), that'd be great.

saching13 commented 11 months ago

@TobyUllrich can I know when these were purchased ?

TobyUllrich commented 10 months ago

@Erol444 I flashed on of the Cameras experiencing the Issue with Bootloader 0.26 and updated our code to use the 2.23 version of the Depthai SDK, update was done by useing the latest docker depthai image. But I seeing the same behavior, as in, the camera will not come back to the network after a few software restarts and needs to be powercycled to be pingable

@themarpe I think this is what you want? cali_dump.txt

@saching13 The Usint wrer bought in April 2023 from this reseller wimood

TobyUllrich commented 10 months ago

Has there been any updates on fixing this issue or replacing the non-performing cameras?

themarpe commented 10 months ago

@TobyUllrich we'll prepare a branch with a potential fix in a bit. Will get back to you once ready

themarpe commented 10 months ago

@TobyUllrich Please give the following branch a test: powercycle_fix_imu and let us know if that resolves the issue of cameras not coming back online

TobyUllrich commented 10 months ago

@themarpe I tried the branch with depthai version and bootloader 0.0.26, but I observed the same error behavior, being the camera is not pingable aber a few restarts of the software

themarpe commented 10 months ago

@TobyUllrich intesresting - we might be hitting some different bug here. Do you mind creating an MRE example that you may share, that reproduces the issue? (above one you've mentioned looks close, but isn't full one, correct?)

TobyUllrich commented 10 months ago

@themarpe I am Using the above example with the parameters -ip . Our system has a RTSP server waiting for the stream, but it is not needed for testing. There might be a better way to set the desired stream setting but this option sufficed for now. If you have any code I should try feel free to send it to me and I can see if it causes the same issues.

If you want to test with the RTSP server, bluenviron/mediamtx, it runs in docker with network mode host

themarpe commented 10 months ago

@TobyUllrich so the above code running as is crashes after some time, and the device does not come online afterwards? Thanks, we'll try to reproduce this on our end as well

TobyUllrich commented 10 months ago


We use the above code inside a docker base on the Depthai docker(Test were done on a bare metal system). Once it starts, it runs without crashing but when we stop the program and try to restart it the device is not pingable(happens after between 10 to 50 cycles, sometimes less). So we Start the program->run it for some time-> stop program -> try to start again but the device is not pingable(Not recoverable).

The reason we need to restart the program is because the cameras are used on remote mobile inspection robots and we stop the software to reduce compute power.

themarpe commented 10 months ago

@TobyUllrich I see - how do you stop the program? If you kill it, it can cause issues. Try sending a signal (SIGINT) to shutdown Python script gracefully, such that the device destructor runs as well

diablodale commented 10 months ago

Stepping in for reality check. @themarpe, the method of the host program stopping is irrelevant.

The host program may stop/lost by graceful ending, by a data center going offline, or by a asteroid hitting the host server.

In all lost host scenarios 🧚☄️, the remote OAK PoE device should itself be reliable -- to notice the loss of the host program and the OAK device do its own internal cleanup and ready itself for a new connection. You want a fleet of OAK devices? Then your fleet needs to support a customers entire datacenter suddendly going offline (e.g. AWS us-east-1), the customer failing over to another geo-located datacenter (e.g. AWS us-west-1), and the fleet of OAKs being ready for the new incoming host connections.

Sudden, random, and repeated killing of host programs should already be part of regular CI testing of OAK devices. I have reported multiple bugs (some still open) where OAK devices fail on repeated start/stop/death. This is a weak area that needs some focus.

themarpe commented 10 months ago

Totally agreed @diablodale - was meant more as a "bandaid" for the time being. We are investigating this further and trying to replicate, as it seems to be different compared to the "IMU induced powercycle" issue of past devices (which was partially resolved, given that above improved fix, didn't resolve the matters)

themarpe commented 10 months ago

@TobyUllrich and just to be sure, do you mind running the program on that temporary branch with DEPTHAI_LEVEL=debug env var set to confirm the library version is indeed: 8b640ae041679d261043e37d4aa22b4c4c38b2df? To really pin down this being a separate issue

TobyUllrich commented 10 months ago

@themarpe Yes I can Confirm I am running version 8b640ae041679d261043e37d4aa22b4c4c38b2df.

TobyUllrich commented 10 months ago

@themarpe Do you have any new branch we should test?

TobyUllrich commented 9 months ago

@themarpe Has there been any progress? If not I think we need to start replacing the defective devices

themarpe commented 9 months ago

@TobyUllrich - we've been a tad busy this week, so haven't pinned this down yet. We were able to reproduce however. We are targeting on having much more information end of Monday.

We can also arrange for a replacement, reachout to support @ CC: @Erol444 to have those devices be replaced from office such that we can also analyze them locally.

themarpe commented 4 months ago

@TobyUllrich et al

We've identified a likely root cause for the powercycle issues on PoE devices:

The PR is WIP but it boils down to updating the (main/factory) Bootloader.

The release is intended to go out EOW

TobyUllrich commented 4 months ago

@themarpe Thank you very much. Do you have any instructions to update the Bootloader? What will be the Release version number?

themarpe commented 4 months ago


The release will be part of DepthAI v2.26 - if you want to test it now, you can do so on that branch and flash the bootloader as such:

depthai needs to be updated to, either (note below is depthai-python repo for ease of use):

git fetch
git checkout bootup_stability_fixes
python3 examples/

or install directly with:

python3 -m pip install --extra-index-url depthai==

And then the factory bootloader needs to be upgraded (importantly, user BL only is not enough):

python3 examples/bootloader/ network

it will flash version 0.0.27+5fb331f993adceeeda72202c233a9e3939ab3dab - currently development version, v2.26 will contain a full BL v0.0.28 release

diablodale commented 4 months ago

Thanks for the update.

I caution Luxonis that my PoE customers are choosing Orbbec's Femto Mega instead of OAK because it has delivered reliable PoE, excellent ToF depth perception, and high frame rates. The RVC2 PoE hardware design flaws documented in previous issues I opened, which Brandon acknowledged, have to-date prevented my customers from choosing RVC2 PoE OAK.

themarpe commented 4 months ago

Thanks for the insights @diablodale - we are continuing to improve the PoE lineup, to fix the issues, especially on the stability front, so thanks for brining this up.

For the ToF applications, we are completing the following model:, which will provide both ToF & Stereo capabilities

diablodale commented 4 months ago

Release mentions PoE work. However, the release doc doesn't mention a factory firmware burn needs to occur as mentioned above. What is the procedure for v2.26?

themarpe commented 4 months ago


  • Improved PoE stability on reboots - eliminate the case where powercycle of the device was sometimes needed

It is mentioned in the release notes, under the above segment. One can utilize either the device manager or the flash_bootloader script to flash the updated factory bootloader

diablodale commented 4 months ago

I was wrong. 🤦‍♂️Uff, sorry for the spam.