Open gberardinelli opened 2 years ago
Sorry about the trouble. We think we have some fixes here for this, we're triaging internally.
And thanks for the reproducible example.
No problem.
Small update: still seeing the same issue when plugged into my laptop via PoE injector (as opposed to the automotive/industrial-style PC with integrated PoE used above).
And I believe this is actually one of the pre-production models if there's any difference there.
Another update: we just received a unit from the regular production run. It experiences the same types of hang-ups during init which require re-trying, but it never goes offline entirely and can always be pinged.
Thanks for the data and sorry again about the issue here. We think we have solved this but with the solution an entirely new (and worse) but is introduced so we're digging at that now.
Hi @gberardinelli
Sorry for late response on this.
Just to confirm - using a regular production unit you do not require powercycling the device in any case? (eg getting it into that state?)
As far as the BL/SW version, the initial data points, seem to be using the version with these issues addressed. One other thing would be to try with latest library (v.2.17.3).
No worries.
That's correct, our two regular production units have been very reliable and have never required a power cycle. Once in a while they take a couple of attempts at starting a pipeline before they get picked up, but that's not a big deal.
Regarding the pre-prod model, I just installed the latest depthai-python
(which seems to be 2.17.4
) and re-ran the above code. It still falls offline in the same way as before-- not pingable and needs a power cycle.
Thanks @gberardinelli
Do you mind letting me know the MXID of the pre-production unit? Also, we'd love to check this specific case out - could you reach out to support@luxonis.com mentioning this issue and we'd swap the pre-production unit with a new one, and we'll analyze the behavior of your unit locally.
As an additional datapoint - the same host & networking equipment was used for both tests of pre and production units?
Sounds good to me, thank you. I'll be in touch via email.
MXID: 18443010017EF50800
And yes indeed, same network hardware. I actually tested both pre and prod devices on two different network setups, described in this thread. Only the pre-production unit had issues on both setups.
I'm using an OAK-D W PoE camera and experiencing some similar sounding issues. I can open a new ticket if you think it would be more useful @Luxonis-Brandon.
The symptoms come after a handful of pipeline runs, the exact number seems to vary, creating a new dai.Device()
fails with RuntimeError: No available devices
I can no longer ping the device after that.
My setup is:
>>> d.getBootloaderVersion().toStringSemver()
'0.0.22'
>>> dai.__version__
'2.21.2.0'
>>>
I think I figured out a workaround. I was using a context manager to close the device, but also a signal handler to close the same device on SIGINT or SIGTERM. These two device closing methods seem not to play nicely, so removing the context manager in favor of the signal handler seems to close the device in a way where I can re-attach to it in subsequent runs without power cycling.
Hello! I have been having this issue as well when working with two OAK-D PoE's. Not sure if there's any resolution I might have missed? Happy to provide information if needed.
For now, I have been hacking a solution by doing:
device = None
try:
device = self.stack.enter_context(dai.Device(device_info))
except:
print(f"Retrying connection to {device_name}... ")
time.sleep(5)
device = self.stack.enter_context(dai.Device(device_info))
This fails in the try clause about 50% of the time but seemingly always works on the second try in the except clause. Obviously not an ideal solution, but would appreciate any guidance to resolve this issue.
Hello, We have 3 OKA D POE cameras, some are PoE W Pro and some just PoE W, but we are experiencing this issue on some of them pros and non-pros, here are the IDs: | MXID | Model | State |
---|---|---|---|
1844301091EB9C0F00 | OAK D PoE W | Recovers | |
1944301041407E1300 | OAK D PoE W Pro | Recovers | |
1944301051407E1300 | OAK D PoE W | Recovers | |
18443010C1385D0F00 | OAK D PoE W Pro | Does not Recover | |
1844301021969E0F00 | OAK D PoE W Pro | Does not Recover | |
1844301071229D0F00 | OAK D PoE W Pro | Recovers |
As we are using the cameras on autonomous mobile robots without a human onsite we need them to work reliably when the software starts. Due to OS compatibility, we are running the code for the cameras in a Docker container but the issue also persists when running the code directly on the system. The PoE router we are using is the Netgear GS308EP PoE Switch. Is it possible to fix this issue in software or do we need to get our broken hardware exchanged? If the latter applies who do I contact?
The code we are using looks like this, but we also tested different version and it leads to the same result.
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter, BooleanOptionalAction
import subprocess as sp
import depthai as dai
# # Parse command line arguments
parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
parser.add_argument("-rip", "--rtsp_host", default="localhost", type=str, help="Host of the RTSP Server")
parser.add_argument("-rp", "--rtsp_port", default=8554, type=int, help="Port of the RTSP Server")
parser.add_argument("-qa", "--quality", default=100, type=int, help="Video quality, from 1 to 100")
parser.add_argument("-ip", "--ip_camera", default=None, help="set the IP of the used Camera")
parser.add_argument("-c", "--color", default="color", help="set the color mode of the camera")
parser.add_argument("-fps", "--frame_rate", default=30, type=int, help="Set Framerate of the Stream")
parser.add_argument("-n", "--name", default="main", help="set stream endpoint")
parser.add_argument("-ss", "--stream_select", default="main", help="set stream endpoint")
bool_args = parser.parse_args()
args = vars(parser.parse_args())
CAMERA_IP = args["ip_camera"]
FPS = args["frame_rate"]
QUALITY = bool_args.quality
RTSP_HOST = args["rtsp_host"]
RTSP_PORT = args["rtsp_port"]
NAME = args["name"]
COLOR = args["color"]
STREAM_SELECT = args["stream_select"]
def setupPipeline():
pipeline = dai.Pipeline()
# Color Pipeline
if COLOR == "color":
colorCam = pipeline.create(dai.node.ColorCamera)
colorCam.setResolution(dai.ColorCameraProperties.SensorResolution.THE_1080_P)
colorCam.setInterleaved(False)
colorCam.setColorOrder(dai.ColorCameraProperties.ColorOrder.BGR)
colorCam.setFps(FPS)
colorCam.setPreviewSize(1920, 1080)
colorCam.setVideoSize(1920, 1080)
# Main Stream
if STREAM_SELECT == "main":
colorVidEnc = pipeline.create(dai.node.VideoEncoder)
colorVidEnc.setDefaultProfilePreset(FPS, dai.VideoEncoderProperties.Profile.H264_MAIN)
colorVidEnc.setKeyframeFrequency(FPS * 4)
colorVidEnc.setQuality(QUALITY)
colorCam.video.link(colorVidEnc.input)
colorVeOut = pipeline.create(dai.node.XLinkOut)
colorVeOut.setStreamName("colorEncodedMain")
colorVidEnc.bitstream.link(colorVeOut.input)
# Sub Stream 360p
if STREAM_SELECT == "sub":
imgScale = pipeline.create(dai.node.ImageManip)
imgScale.initialConfig.setResize(640,360)
imgScale.initialConfig.setFrameType(dai.ImgFrame.Type.NV12)
colorCam.video.link(imgScale.inputImage)
colorVidEnc1 = pipeline.create(dai.node.VideoEncoder)
colorVidEnc1.setDefaultProfilePreset(FPS, dai.VideoEncoderProperties.Profile.H264_MAIN)
colorVidEnc1.setKeyframeFrequency(FPS * 4)
colorVidEnc1.setQuality(QUALITY)
imgScale.out.link(colorVidEnc1.input)
colorVeOut1 = pipeline.create(dai.node.XLinkOut)
colorVeOut1.setStreamName("colorEncodedSub")
colorVidEnc1.bitstream.link(colorVeOut1.input)
# Mono Pipeline
if COLOR == "mono":
# Main Stream
monoCam = pipeline.create(dai.node.MonoCamera)
monoCam.setBoardSocket(dai.CameraBoardSocket.LEFT)
monoCam.setResolution(dai.MonoCameraProperties.SensorResolution.THE_800_P)
monoCam.setFps(FPS)
if STREAM_SELECT == "main":
monoVidenc = pipeline.create(dai.node.VideoEncoder)
monoVidenc.setDefaultProfilePreset(FPS, dai.VideoEncoderProperties.Profile.H264_MAIN)
monoVidenc.setKeyframeFrequency(FPS * 4)
monoVidenc.setQuality(QUALITY)
monoCam.out.link(monoVidenc.input)
monoVeOut = pipeline.create(dai.node.XLinkOut)
monoVeOut.setStreamName("monoEncodedMain")
monoVidenc.bitstream.link(monoVeOut.input)
if STREAM_SELECT == "sub":
# Sub Stream
monoImageManip = pipeline.create(dai.node.ImageManip)
monoImageManip.initialConfig.setResize(320,200)
monoImageManip.initialConfig.setFrameType(dai.ImgFrame.Type.NV12)
monoCam.out.link(monoImageManip.inputImage)
monoVidencSub = pipeline.create(dai.node.VideoEncoder)
monoVidencSub.setDefaultProfilePreset(FPS, dai.VideoEncoderProperties.Profile.H264_MAIN)
monoVidencSub.setKeyframeFrequency(FPS * 4)
monoVidencSub.setQuality(QUALITY)
monoImageManip.out.link(monoVidencSub.input)
monoVeOutSub = pipeline.create(dai.node.XLinkOut)
monoVeOutSub.setStreamName("monoEncodedSub")
monoVidencSub.bitstream.link(monoVeOutSub.input)
return pipeline
def getCameraInfo(cam_ip):
device_infos = dai.Device.getAllAvailableDevices()
if len(device_infos) == 100:
raise RuntimeError("No DepthAI device found!")
else:
print("Available devices:")
for i, info in enumerate(device_infos):
print(f"[{i}] {info.getMxId()} [{info.state.name}]")
try:
device = dai.DeviceInfo(cam_ip)
except:
raise ValueError("Incorrect value supplied: {}".format(cam_ip))
return device
if __name__ == "__main__":
pipeline = setupPipeline()
device_info = getCameraInfo(CAMERA_IP)
if COLOR == "color" and STREAM_SELECT == "main":
color_main_stream_cmd = [
"ffmpeg",
"-probesize", "100M",
"-i", "-",
"-f", "rtsp",
"-rtsp_transport", "udp",
"-framerate", str(FPS),
"-vcodec", "copy",
"-v", "error",
f"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/color_main"
]
if COLOR == "color" and STREAM_SELECT == "sub":
color_sub_stream_cmd = [
"ffmpeg",
"-probesize", "100M",
"-i", "-",
"-f", "rtsp",
"-rtsp_transport", "udp",
"-framerate", str(FPS),
"-vcodec", "copy",
"-v", "error",
f"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/color_sub"
]
if COLOR == "mono" and STREAM_SELECT == "main":
mono_main_stream_cmd = [
"ffmpeg",
"-probesize", "100M",
"-i", "-",
"-f", "rtsp",
"-rtsp_transport", "udp",
"-framerate", str(FPS),
"-vcodec", "copy",
"-v", "error",
f"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/mono_main"
]
if COLOR == "mono" and STREAM_SELECT == "sub":
mono_sub_stream_cmd = [
"ffmpeg",
"-probesize", "100M",
"-i", "-",
"-f", "rtsp",
"-rtsp_transport", "udp",
"-framerate", str(FPS),
"-vcodec", "copy",
"-v", "error",
f"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/mono_sub"
]
try:
if COLOR == "color" and STREAM_SELECT == "main":
color_main_rtsp_proc = sp.Popen(color_main_stream_cmd, stdin=sp.PIPE) # Start the ffmpeg process
if COLOR == "color" and STREAM_SELECT == "sub":
color_sub_rtsp_proc = sp.Popen(color_sub_stream_cmd, stdin=sp.PIPE) # Start the ffmpeg process
if COLOR == "mono" and STREAM_SELECT == "main":
mono_main_rtsp_proc = sp.Popen(mono_main_stream_cmd, stdin=sp.PIPE) # Start the ffmpeg process
if COLOR == "mono" and STREAM_SELECT == "sub":
mono_sub_rtsp_proc = sp.Popen(mono_sub_stream_cmd, stdin=sp.PIPE) # Start the ffmpeg process
except:
exit("Error: cannot run ffmpeg!\nTry running: sudo apt install ffmpeg")
with dai.Device(pipeline, device_info) as device:
if COLOR == "color" and STREAM_SELECT == "main":
colorMainEncoded = device.getOutputQueue("colorEncodedMain", maxSize=40, blocking=True)
if COLOR == "color" and STREAM_SELECT == "sub":
colorSubEncoded = device.getOutputQueue("colorEncodedSub", maxSize=40, blocking=True)
if COLOR == "mono" and STREAM_SELECT == "main":
monoMainEncoded = device.getOutputQueue("monoEncodedMain", maxSize=40, blocking=True)
if COLOR == "mono" and STREAM_SELECT == "sub":
monoSubEncoded = device.getOutputQueue("monoEncodedSub", maxSize=40, blocking=True)
if COLOR == "color" and STREAM_SELECT == "main":
print(f"Setup finished, RTSP stream available under \"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/color_main\"")
if COLOR == "color" and STREAM_SELECT == "sub":
print(f"Setup finished, RTSP stream available under \"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/color_sub\"")
if COLOR == "mono" and STREAM_SELECT == "main":
print(f"Setup finished, RTSP stream available under \"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/mono_main\"")
if COLOR == "mono" and STREAM_SELECT == "sub":
print(f"Setup finished, RTSP stream available under \"rtsp://{RTSP_HOST}:{RTSP_PORT}/{NAME}/mono_sub\"")
try:
while True:
if COLOR == "color" and STREAM_SELECT == "main":
data1 = colorMainEncoded.get().getData() # Blocking call, will wait until new data has arrived
color_main_rtsp_proc.stdin.write(data1)
if COLOR == "color" and STREAM_SELECT == "sub":
data2 = colorSubEncoded.get().getData() # Blocking call, will wait until new data has arrived
color_sub_rtsp_proc.stdin.write(data2)
if COLOR == "mono" and STREAM_SELECT == "main":
data3 = monoMainEncoded.get().getData() # Blocking call, will wait until new data has arrived
mono_main_rtsp_proc.stdin.write(data3)
if COLOR == "mono" and STREAM_SELECT == "sub":
data4 = monoSubEncoded.get().getData() # Blocking call, will wait until new data has arrived
mono_sub_rtsp_proc.stdin.write(data4)
except:
pass
if COLOR == "color" and STREAM_SELECT == "main":
color_main_rtsp_proc.stdin.close()
if COLOR == "color" and STREAM_SELECT == "sub":
color_sub_rtsp_proc.stdin.close()
if COLOR == "mono" and STREAM_SELECT == "main":
mono_main_rtsp_proc.stdin.close()
if COLOR == "mono" and STREAM_SELECT == "sub":
mono_sub_rtsp_proc.stdin.close()
Hi @TobyUllrich , What's the depthai version you are using and bootloader version that's flashed on the device? The script you sent isn't very minimal, so it doesn't help with the debugging process.
Depthai version: 2.22.0.0.dev+dev Bootloader: 0.0.22 and 0.0.21 The cameras that dont work are 0.0.22 but some cameras with 0.0.22 work as well
If you can supply me with a more minimal pipeline I should test, then I can do so.
What we are trying to do with the cameras is to get the encoded stream of a selected camera as an RTSP stream, this works fine with all cameras but when stopping the python program some of the cameras are not reachable through the network anymore, thus needing a powercycle to work again
@TobyUllrich could you try flashing the latest bootlaoder version (0.26) and using the latest depthai (2.23)?
@TobyUllrich do you mind also sending over the calibration_dump results (examples/calibration), that'd be great.
@TobyUllrich can I know when these were purchased ?
@Erol444 I flashed on of the Cameras experiencing the Issue with Bootloader 0.26 and updated our code to use the 2.23 version of the Depthai SDK, update was done by useing the latest docker depthai image. But I seeing the same behavior, as in, the camera will not come back to the network after a few software restarts and needs to be powercycled to be pingable
@themarpe I think this is what you want? cali_dump.txt
@saching13 The Usint wrer bought in April 2023 from this reseller wimood
Has there been any updates on fixing this issue or replacing the non-performing cameras?
@TobyUllrich we'll prepare a branch with a potential fix in a bit. Will get back to you once ready
@TobyUllrich
Please give the following branch a test: powercycle_fix_imu
and let us know if that resolves the issue of cameras not coming back online
@themarpe I tried the branch with depthai version 2.23.0.0.dev0+8b640ae041679d261043e37d4aa22b4c4c38b2df and bootloader 0.0.26, but I observed the same error behavior, being the camera is not pingable aber a few restarts of the software
@TobyUllrich intesresting - we might be hitting some different bug here. Do you mind creating an MRE example that you may share, that reproduces the issue? (above one you've mentioned looks close, but isn't full one, correct?)
@themarpe I am Using the above example with the parameters -ip
If you want to test with the RTSP server, bluenviron/mediamtx, it runs in docker with network mode host
@TobyUllrich so the above code running as is crashes after some time, and the device does not come online afterwards? Thanks, we'll try to reproduce this on our end as well
@themarpe
We use the above code inside a docker base on the Depthai docker(Test were done on a bare metal system). Once it starts, it runs without crashing but when we stop the program and try to restart it the device is not pingable(happens after between 10 to 50 cycles, sometimes less). So we Start the program->run it for some time-> stop program -> try to start again but the device is not pingable(Not recoverable).
The reason we need to restart the program is because the cameras are used on remote mobile inspection robots and we stop the software to reduce compute power.
@TobyUllrich I see - how do you stop the program? If you kill it, it can cause issues. Try sending a signal (SIGINT) to shutdown Python script gracefully, such that the device destructor runs as well
Stepping in for reality check. @themarpe, the method of the host program stopping is irrelevant.
The host program may stop/lost by graceful ending, by a data center going offline, or by a asteroid hitting the host server.
In all lost host scenarios 🧚☄️, the remote OAK PoE device should itself be reliable -- to notice the loss of the host program and the OAK device do its own internal cleanup and ready itself for a new connection. You want a fleet of OAK devices? Then your fleet needs to support a customers entire datacenter suddendly going offline (e.g. AWS us-east-1), the customer failing over to another geo-located datacenter (e.g. AWS us-west-1), and the fleet of OAKs being ready for the new incoming host connections.
Sudden, random, and repeated killing of host programs should already be part of regular CI testing of OAK devices. I have reported multiple bugs (some still open) where OAK devices fail on repeated start/stop/death. This is a weak area that needs some focus.
Totally agreed @diablodale - was meant more as a "bandaid" for the time being. We are investigating this further and trying to replicate, as it seems to be different compared to the "IMU induced powercycle" issue of past devices (which was partially resolved, given that above improved fix, didn't resolve the matters)
@TobyUllrich and just to be sure, do you mind running the program on that temporary branch with DEPTHAI_LEVEL=debug
env var set to confirm the library version is indeed: 8b640ae041679d261043e37d4aa22b4c4c38b2df
? To really pin down this being a separate issue
@themarpe Yes I can Confirm I am running version 8b640ae041679d261043e37d4aa22b4c4c38b2df
.
@themarpe Do you have any new branch we should test?
@themarpe Has there been any progress? If not I think we need to start replacing the defective devices
@TobyUllrich - we've been a tad busy this week, so haven't pinned this down yet. We were able to reproduce however. We are targeting on having much more information end of Monday.
We can also arrange for a replacement, reachout to support @ luxonis.com CC: @Erol444 to have those devices be replaced from office such that we can also analyze them locally.
@TobyUllrich et al
We've identified a likely root cause for the powercycle issues on PoE devices: https://github.com/luxonis/depthai-core/pull/1020
The PR is WIP but it boils down to updating the (main/factory) Bootloader.
The release is intended to go out EOW
@themarpe Thank you very much. Do you have any instructions to update the Bootloader? What will be the Release version number?
@TobyUllrich
The release will be part of DepthAI v2.26 - if you want to test it now, you can do so on that branch and flash the bootloader as such:
depthai needs to be updated to 2.25.1.0.dev0+8c66c534eba51e8d1c1701bb9a9c2fef73261423
, either (note below is depthai-python
repo for ease of use):
git fetch
git checkout bootup_stability_fixes
python3 examples/install_requirements.py
or install directly with:
python3 -m pip install --extra-index-url https://artifacts.luxonis.com/artifactory/luxonis-python-snapshot-local/ depthai==2.25.1.0.dev0+8c66c534eba51e8d1c1701bb9a9c2fef73261423
And then the factory bootloader needs to be upgraded (importantly, user BL only is not enough):
python3 examples/bootloader/flash_bootloader.py network
it will flash version 0.0.27+5fb331f993adceeeda72202c233a9e3939ab3dab
- currently development version, v2.26 will contain a full BL v0.0.28 release
Thanks for the update.
I caution Luxonis that my PoE customers are choosing Orbbec's Femto Mega instead of OAK because it has delivered reliable PoE, excellent ToF depth perception, and high frame rates. The RVC2 PoE hardware design flaws documented in previous issues I opened, which Brandon acknowledged, have to-date prevented my customers from choosing RVC2 PoE OAK.
Thanks for the insights @diablodale - we are continuing to improve the PoE lineup, to fix the issues, especially on the stability front, so thanks for brining this up.
For the ToF applications, we are completing the following model: https://shop.luxonis.com/products/oak-d-sr-poe, which will provide both ToF & Stereo capabilities
Release https://github.com/luxonis/depthai-core/releases/tag/v2.26.0 mentions PoE work. However, the release doc doesn't mention a factory firmware burn needs to occur as mentioned above. What is the procedure for v2.26?
@diablodale
- Improved PoE stability on reboots - eliminate the case where powercycle of the device was sometimes needed
- NOTE: This requires flashing the factory bootloader - by running the flash script or using the device manager
It is mentioned in the release notes, under the above segment. One can utilize either the device manager or the flash_bootloader script to flash the updated factory bootloader
I was wrong. 🤦♂️Uff, sorry for the spam.
This is possibly related to #415 but I've been unable to run @diablodale's repro code so far.
I use
depthai-python
to demonstrate below, but similar issues manifest when using c++ fromdepthai-ros-examples
.Problem
Our PoE device frequently goes offline entirely, becoming unresponsive to pings and unable to be found by depthai. This generally seems to occur when restarting depthai processes on the host machine. Killing the process less-than-gracefully increases the frequency of the issue, so perhaps it's related to whatever teardown mechanism is built in.
Additionally, it often seems to hang after device discovery until it's tried again. It remains ping-able in these cases.
Config
depthai-python/examples/IMU/imu_firmware_update.py
Reproducing
Run a pipeline such as this, wait for it to spin up, send it SIGKILL, and repeat.
After a number of times, the device will no longer be found and will not respond to pings.
Here's some code to make that process slightly more scientific. For me this will trigger the issue after about 5-10 iterations.
Running the above most recently resulted in:
Resources - Archive 'depthai-device-fwp-1cfa51d59471f57c43cfef2b69205c227c272958.tar.xz' open
Searching for device: DeviceInfo(name=, mxid=18443010017EF50800, X_LINK_BOOTED, X_LINK_TCP_IP, X_LINK_MYRIAD_X, X_LINK_SUCCESS)
Searching for device: " "
Searching for device: " "
Searching for device: " "
RuntimeError: No available devices
. Can no longer ping the device.Full output attached: oakd_poe_test_killed_pipeline.txt
I get similar behavior and ultimately the same outcome every time I run the above.