Open dhartness opened 1 year ago
Hi @dhartness
Thanks for the report - I have some suspicions as to what might cause this, but nothing stands out as "the obvious" thing.
The issue stems from the native library and is therefore "uncatchable" from Python side.
I think it'd be best if you set the creating of coredump files & share that after this crash.
echo 'core' | sudo tee /proc/sys/kernel/core_pattern
ulimit -c unlimited
# Then execute the script
If you see the above crash you should get a core.[id]
file - we'll be able to analyze that, to see into what state the system got for it to crash in such manner.
Will do. Thanks for the directive.
I'm going to add that command to the crontab. It finally occurred again but apparently someone had undone my changes.
@themarpe, I'm looking around in the Queue code area due to another issue.
parseMessageToADatatype()
does do some pointer things without validating them. It assumes the earlier readMove()
in the Queue loop throws when XLink has an internal error. However, if XLink is "X_LINK_SUCCESS" yet XLink didn't populate a valid StreamPacketDesc
...then parseMessageToADatatype()
blindly uses them and could segfault. I could imagine a scenario of an OAK device creating a StreamPacketDesc_t with unexpected/bad values and then that invalid struct is sent to the host and then passed to parseMessageToADatatype()
that blindly uses it.
I think validating a StreamPacketDesc
is limited. Code could check for data != null
and length > 8
. But its not possible to validate that the memory block to which data points is actually length size--that needs to be done/checked within the XLink code itself. I don't (yet) see in code any checks that the serializedObjectSize
is actually the size needed for each of the parseDatatype<T>
in the switch/case. They usually call utility::deserialize()
and then nlohmann::json::parse(raw pointers)
. So if those pointers are invalid/out-of-bounds, it could segfault.
Google search suggests that the specific segfault of Invalid permissions for mapped object
is often associated with trying to manipulate const values (i.e. trying to write to read-only memory). But perhaps that segfault message is misleading. 🤷♀️
Here's example of a StreamPacketDesc
with ::data=null and ::length=0. It causes pointer wraparound and then a segfault.
https://godbolt.org/z/eYhncd4Gq
I agree with the analysis - if packet was sent through but is not validly constructed (or somehow corrupted in flight), the this could occur.
I'll dive into it and make it more robust + add some tests to go along.
Was checking this a little when it was first reported, but i assume it must fall down to a packet being garbled but still successfully coming through stack.
Loop me in if you have ideas/code on which you want me to provide feedback.
I think you said to me that the XLink for PoE devices uses TCP. If that's true, then it isn't realistically possible to corrupt in the OS/hardware networking section of the stack. And I don't think a cosmic ray bit-flip would happen twice to the same customer. I'm suspicious of device-side code for both xlink and the layers on top of it. Perhaps, something rare, an unexpected camera/data value which the firmware's json/pack code then doesn't notice or transform correctly which leads to an invalid packet.
If packet validation tests can be put in depthai-core code, I see at least two approaches:
ADatatype
derived type be dropped? Looking through the list of the derived in the switch/case...it seems like they can be dropped. If yes, tests could be put in parseMessageToADatatype()
, test fails return an empty shared_ptr, and in the Queue loop if packet.empty() { log; continue; }
.A fuzz test would be good here. Send random values within a StreamPacketDesc
from a device. Let device-side xlink package it and send to host and wait for the crashes. And then keep adding as many checks as possible.
@diablodale
I think its highly likely this event is tied with a device side crash or some sort of error.
Due to this, I think sanest way would be to throw as do other functions when there is a comms exception. (much better than crashing host side)
A fuzz test would be good here. Send random values within a StreamPacketDesc from a device. Let device-side xlink package it and send to host and wait for the crashes. And then keep adding as many checks as possible.
I agree - was thinking of adding something with Catch. Will open a PR if/when time permits
Hi,
I've had this issue occur again but I don't see a core.[id] file having been created.
I have the following two lines in my active crontab file:
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h dom mon dow command
@reboot echo 'core' | sudo tee /proc/sys/kernel/core_pattern
@reboot ulimit -c unlimited
Is there something I'm missing that should cause this to capture the data from the crash?
The system was last rebooted on the 5th before I left for the day.
hviz@hvizpi18:~ $ cd HVision/
hviz@hvizpi18:~/HVision $ sudo crontab -e
No modification made
hviz@hvizpi18:~/HVision $ ls -l /proc/sys/kernel/c*
-rw------- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/cad_pid
-r--r--r-- 1 root root 0 Jun 5 16:56 /proc/sys/kernel/cap_last_cap
-rw-r--r-- 1 root root 0 Jun 5 16:56 /proc/sys/kernel/core_pattern
-rw-r--r-- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/core_pipe_limit
-rw-r--r-- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/core_uses_pid
-rw-r--r-- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/ctrl-alt-del
hviz@hvizpi18:~/HVision $
I tried the following 'find' command.
hviz@hvizpi18:~/HVision $ sudo find / -name core.*
/usr/share/man/man5/core.5.gz
/usr/lib/python3/dist-packages/numpy/ma/core.py
/usr/lib/python3/dist-packages/numpy/ma/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/numpy/distutils/core.py
/usr/lib/python3/dist-packages/numpy/distutils/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/setuptools/_distutils/core.py
/usr/lib/python3/dist-packages/setuptools/_distutils/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/pidng/core.py
/usr/lib/python3/dist-packages/pidng/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/certifi/core.py
/usr/lib/python3/dist-packages/certifi/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/idna/core.py
/usr/lib/python3/dist-packages/idna/__pycache__/core.cpython-39.pyc
/usr/lib/python3.9/distutils/core.py
/usr/lib/python3.9/distutils/__pycache__/core.cpython-39.pyc
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/pyparsing/core.py
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/pyparsing/__pycache__/core.cpython-39.pyc
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/certifi/core.py
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/certifi/__pycache__/core.cpython-39.pyc
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/idna/core.py
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/idna/__pycache__/core.cpython-39.pyc
hviz@hvizpi18:~/HVision $
Is there another log this information may be contained in? I can see from my application's log that it stopped just after Jun 10th a 01:11. I'm going through the var/log files currently but nothing currently stands out.
@dhartness I suggest you check latest develop
or switch to v2.22.0 release once its out. From there we can check if it resolves the issue
After running for several days in a script that utilizes six OAK POE devices, five OAK-1 and one OAK-D, one of them threw the following error and crashed the entire script:
` Stack trace (most recent call last) in thread 8936:
6 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in
5 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0x7f92638c1b, in
4 Object "/lib/aarch64-linux-gnu/libpthread.so.0", at 0x7f92817647, in
3 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f887e104b, in
2 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f886c587b, in
1 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f887282c7, in dai::StreamMessageParser::parseMessageToADatatype(streamPacketDesc_t*)
0 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0x7f925eb64c, in
Segmentation fault (Invalid permissions for mapped object [0x7f7489e000]) Segmentation fault hviz@hvizpi18:~/HVision $ sudo screendump 1 > term001.log`
At this point it has happened twice. We've normally observed the system, with a mix of up to ten cameras, running for over a week. Because of the infrequency we are unable to reproduce and it doesn't indicate a line number so we're unable to verify its not happening outside our try/except block.
I don't see any thrown exceptions on screen from the time the script was running and none were caught in the exception block.
Even if a camera has crashed, normally we catch it in an exception block and then reload the camera, but this bug has killed the entire script, dumping us back to the CLI, and I'm not sure how to determine which camera has caused it.
dthompson_mre_03202023.zip log_system_information.zip
This is the run() loop the camera uses: `while not self.stopped: singletimecall = time.perf_counter() if((singletimecall - checkforvaluechange) > 14): if self.rm.getmyneedtoreload(self.camindex): checkforvaluechange = singletimecall self.reloadcam() #############################
###################### self.trackletsData = None try: while self.needtoreinitialize: self.reloadcam() ################################################### imgFrame = self.preview.get() inDet = self.qDet.get() if imgFrame is not None: frame = imgFrame.getCvFrame() if(len(frametimestamps) > 1): dasfps = round((len(frametimestamps)/(frametimestamps[-1]-frametimestamps[0])),1) if (self.icountitems == 2) and (track is not None): self.trackletsData = track.tracklets if inDet is not None: detections = inDet.detections self.crop_queue.append((detections, frame, self.trackletsData, dasfps)) frametimestamps.append(time.perf_counter()) ################################################### self.recoveryattempts = 0 self.myhealth[self.camindex-1] = True self.myhealth[self.camindex-1] = self.healthychild[0] except Exception as exception: self.myhealth[self.camindex-1] = False exc_type, exc_obj, exc_tb = sys.exc_info() self.oaklogging.error(""+self.camlogident+"An error has occurred around the Oak Device while active: "+str(exc_type)+" "+str(exc_tb.tb_lineno)+"") self.oaklogging.error(""+self.camlogident+"Exception: {}".format(type(exception).name)+"") self.oaklogging.error(""+self.camlogident+"Exception message: {}".format(exception)+"") self.device.close() self.recoveryattempts += 1 self.needtoreinitialize = True`