dhartness commented 1 year ago

After running for several days in a script that utilizes six OAK POE devices, five OAK-1 and one OAK-D, one of them threw the following error and crashed the entire script:

` Stack trace (most recent call last) in thread 8936:

6 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in

5 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0x7f92638c1b, in

4 Object "/lib/aarch64-linux-gnu/libpthread.so.0", at 0x7f92817647, in

3 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f887e104b, in

2 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f886c587b, in

1 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f887282c7, in dai::StreamMessageParser::parseMessageToADatatype(streamPacketDesc_t*)

0 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0x7f925eb64c, in

Segmentation fault (Invalid permissions for mapped object [0x7f7489e000]) Segmentation fault hviz@hvizpi18:~/HVision $ sudo screendump 1 > term001.log`

At this point it has happened twice. We've normally observed the system, with a mix of up to ten cameras, running for over a week. Because of the infrequency we are unable to reproduce and it doesn't indicate a line number so we're unable to verify its not happening outside our try/except block.

I don't see any thrown exceptions on screen from the time the script was running and none were caught in the exception block.

Even if a camera has crashed, normally we catch it in an exception block and then reload the camera, but this bug has killed the entire script, dumping us back to the CLI, and I'm not sure how to determine which camera has caused it.

dthompson_mre_03202023.zip log_system_information.zip

This is the run() loop the camera uses: `while not self.stopped: singletimecall = time.perf_counter() if((singletimecall - checkforvaluechange) > 14): if self.rm.getmyneedtoreload(self.camindex): checkforvaluechange = singletimecall self.reloadcam() #############################

###################### self.trackletsData = None try: while self.needtoreinitialize: self.reloadcam() ################################################### imgFrame = self.preview.get() inDet = self.qDet.get() if imgFrame is not None: frame = imgFrame.getCvFrame() if(len(frametimestamps) > 1): dasfps = round((len(frametimestamps)/(frametimestamps[-1]-frametimestamps[0])),1) if (self.icountitems == 2) and (track is not None): self.trackletsData = track.tracklets if inDet is not None: detections = inDet.detections self.crop_queue.append((detections, frame, self.trackletsData, dasfps)) frametimestamps.append(time.perf_counter()) ################################################### self.recoveryattempts = 0 self.myhealth[self.camindex-1] = True self.myhealth[self.camindex-1] = self.healthychild[0] except Exception as exception: self.myhealth[self.camindex-1] = False exc_type, exc_obj, exc_tb = sys.exc_info() self.oaklogging.error(""+self.camlogident+"An error has occurred around the Oak Device while active: "+str(exc_type)+" "+str(exc_tb.tb_lineno)+"") self.oaklogging.error(""+self.camlogident+"Exception: {}".format(type(exception).name)+"") self.oaklogging.error(""+self.camlogident+"Exception message: {}".format(exception)+"") self.device.close() self.recoveryattempts += 1 self.needtoreinitialize = True`

themarpe commented 1 year ago

Hi @dhartness

Thanks for the report - I have some suspicions as to what might cause this, but nothing stands out as "the obvious" thing.

The issue stems from the native library and is therefore "uncatchable" from Python side.

I think it'd be best if you set the creating of coredump files & share that after this crash.

echo 'core' | sudo tee /proc/sys/kernel/core_pattern
ulimit -c unlimited
# Then execute the script

If you see the above crash you should get a core.[id] file - we'll be able to analyze that, to see into what state the system got for it to crash in such manner.

dhartness commented 1 year ago

Will do. Thanks for the directive.

dhartness commented 1 year ago

I'm going to add that command to the crontab. It finally occurred again but apparently someone had undone my changes.

diablodale commented 1 year ago

@themarpe, I'm looking around in the Queue code area due to another issue.

parseMessageToADatatype() does do some pointer things without validating them. It assumes the earlier readMove() in the Queue loop throws when XLink has an internal error. However, if XLink is "X_LINK_SUCCESS" yet XLink didn't populate a valid StreamPacketDesc...then parseMessageToADatatype() blindly uses them and could segfault. I could imagine a scenario of an OAK device creating a StreamPacketDesc_t with unexpected/bad values and then that invalid struct is sent to the host and then passed to parseMessageToADatatype() that blindly uses it.

I think validating a StreamPacketDesc is limited. Code could check for data != null and length > 8. But its not possible to validate that the memory block to which data points is actually length size--that needs to be done/checked within the XLink code itself. I don't (yet) see in code any checks that the serializedObjectSize is actually the size needed for each of the parseDatatype<T> in the switch/case. They usually call utility::deserialize() and then nlohmann::json::parse(raw pointers). So if those pointers are invalid/out-of-bounds, it could segfault.

Google search suggests that the specific segfault of Invalid permissions for mapped object is often associated with trying to manipulate const values (i.e. trying to write to read-only memory). But perhaps that segfault message is misleading. 🤷‍♀️

Here's example of a StreamPacketDesc with ::data=null and ::length=0. It causes pointer wraparound and then a segfault. https://godbolt.org/z/eYhncd4Gq

themarpe commented 1 year ago

I agree with the analysis - if packet was sent through but is not validly constructed (or somehow corrupted in flight), the this could occur.

I'll dive into it and make it more robust + add some tests to go along.

Was checking this a little when it was first reported, but i assume it must fall down to a packet being garbled but still successfully coming through stack.

diablodale commented 1 year ago

Loop me in if you have ideas/code on which you want me to provide feedback.

I think you said to me that the XLink for PoE devices uses TCP. If that's true, then it isn't realistically possible to corrupt in the OS/hardware networking section of the stack. And I don't think a cosmic ray bit-flip would happen twice to the same customer. I'm suspicious of device-side code for both xlink and the layers on top of it. Perhaps, something rare, an unexpected camera/data value which the firmware's json/pack code then doesn't notice or transform correctly which leads to an invalid packet.

If packet validation tests can be put in depthai-core code, I see at least two approaches:

throw: Leads to shut down of the Queue loop and eventually disconnect. Aggressive yet safe. Customers are responsible for handling; which the OP writes they do.
drop: Is this valid? Can any/all packets of an ADatatype derived type be dropped? Looking through the list of the derived in the switch/case...it seems like they can be dropped. If yes, tests could be put in parseMessageToADatatype(), test fails return an empty shared_ptr, and in the Queue loop if packet.empty() { log; continue; }.

A fuzz test would be good here. Send random values within a StreamPacketDesc from a device. Let device-side xlink package it and send to host and wait for the crashes. And then keep adding as many checks as possible.

themarpe commented 1 year ago

@diablodale

I think its highly likely this event is tied with a device side crash or some sort of error.

Due to this, I think sanest way would be to throw as do other functions when there is a comms exception. (much better than crashing host side)

A fuzz test would be good here. Send random values within a StreamPacketDesc from a device. Let device-side xlink package it and send to host and wait for the crashes. And then keep adding as many checks as possible.

I agree - was thinking of adding something with Catch. Will open a PR if/when time permits

dhartness commented 1 year ago

Hi,

I've had this issue occur again but I don't see a core.[id] file having been created.

I have the following two lines in my active crontab file:

# For more information see the manual pages of crontab(5) and cron(8)
#
# m h  dom mon dow   command
@reboot echo 'core' | sudo tee /proc/sys/kernel/core_pattern
@reboot ulimit -c unlimited

Is there something I'm missing that should cause this to capture the data from the crash?

The system was last rebooted on the 5th before I left for the day.

hviz@hvizpi18:~ $ cd HVision/
hviz@hvizpi18:~/HVision $ sudo crontab -e
No modification made
hviz@hvizpi18:~/HVision $ ls -l /proc/sys/kernel/c*
-rw------- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/cad_pid
-r--r--r-- 1 root root 0 Jun  5 16:56 /proc/sys/kernel/cap_last_cap
-rw-r--r-- 1 root root 0 Jun  5 16:56 /proc/sys/kernel/core_pattern
-rw-r--r-- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/core_pipe_limit
-rw-r--r-- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/core_uses_pid
-rw-r--r-- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/ctrl-alt-del
hviz@hvizpi18:~/HVision $

I tried the following 'find' command.

hviz@hvizpi18:~/HVision $ sudo find / -name core.*
/usr/share/man/man5/core.5.gz
/usr/lib/python3/dist-packages/numpy/ma/core.py
/usr/lib/python3/dist-packages/numpy/ma/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/numpy/distutils/core.py
/usr/lib/python3/dist-packages/numpy/distutils/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/setuptools/_distutils/core.py
/usr/lib/python3/dist-packages/setuptools/_distutils/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/pidng/core.py
/usr/lib/python3/dist-packages/pidng/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/certifi/core.py
/usr/lib/python3/dist-packages/certifi/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/idna/core.py
/usr/lib/python3/dist-packages/idna/__pycache__/core.cpython-39.pyc
/usr/lib/python3.9/distutils/core.py
/usr/lib/python3.9/distutils/__pycache__/core.cpython-39.pyc
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/pyparsing/core.py
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/pyparsing/__pycache__/core.cpython-39.pyc
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/certifi/core.py
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/certifi/__pycache__/core.cpython-39.pyc
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/idna/core.py
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/idna/__pycache__/core.cpython-39.pyc
hviz@hvizpi18:~/HVision $

Is there another log this information may be contained in? I can see from my application's log that it stopped just after Jun 10th a 01:11. I'm going through the var/log files currently but nothing currently stands out.

themarpe commented 1 year ago

@dhartness I suggest you check latest develop or switch to v2.22.0 release once its out. From there we can check if it resolves the issue

luxonis / depthai-core

[BUG] StreamMessageParser::parseMessageToADatatype avoids exception and crashes script #779

6 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in

5 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0x7f92638c1b, in

4 Object "/lib/aarch64-linux-gnu/libpthread.so.0", at 0x7f92817647, in

3 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f887e104b, in

2 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f886c587b, in

1 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f887282c7, in dai::StreamMessageParser::parseMessageToADatatype(streamPacketDesc_t*)

0 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0x7f925eb64c, in