marieai / marie-ai

Integrate AI-powered Document Analysis Pipelines
MIT License
60 stars 5 forks source link

segfault cv2.abi3.so #92

Open gregbugaj opened 11 months ago

gregbugaj commented 11 months ago

Describe the bug

Getting an segfault while running marie server. This is problematic as it creates a defunct aka zombie process cause the kernel to leave a task stuck in uninterruptible "D" state. A task/process in that state cannot be killed kill -9.

log output from dmesg

[227392.162828] perf: interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[240010.209731] marie[111619]: segfault at 7f2987800b00 ip 00007f30a0f7895b sp 00007f2a655f71d0 error 4 in cv2.abi3.so[7f30a0745000+2f4f000] likely on CPU 10 (core 20, socket 0)
[240010.209740] Code: 48 63 4d 00 48 8b 7c 24 08 89 da 44 8d 43 01 49 03 7f 28 48 8b 47 18 48 8b b7 d0 00 00 00 85 c9 0f 8e a1 16 00 00 4d 8b 4f 18 <45> 8b 34 89 48 8b 4c 24 20 80 3c 19 00 0f 85 52 fe ff ff c7 45 00
[240215.295317] INFO: task marie:110689 blocked for more than 120 seconds.
[240215.295324]       Tainted: P           OE      6.2.0-37-generic #38~22.04.1-Ubuntu
[240215.295326] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[240215.295328] task:marie           state:D stack:0     pid:110689 ppid:110658 flags:0x00000002
[240215.295331] Call Trace:
[240215.295332]  <TASK>
[240215.295335]  __schedule+0x2b7/0x5f0
[240215.295339]  schedule+0x68/0x110
[240215.295341]  do_exit+0xf3/0x6c0
[240215.295343]  do_group_exit+0x35/0x90
[240215.295347]  get_signal+0x8a5/0x8d0
[240215.295349]  ? __f_unlock_pos+0x12/0x20
[240215.295352]  arch_do_signal_or_restart+0x2a/0x120
[240215.295355]  ? exit_to_user_mode_prepare+0x3b/0xd0
[240215.295357]  exit_to_user_mode_loop+0xaf/0x140
[240215.295358]  exit_to_user_mode_prepare+0xb9/0xd0
[240215.295359]  irqentry_exit_to_user_mode+0x9/0x20
[240215.295361]  irqentry_exit+0x43/0x50
[240215.295363]  sysvec_reschedule_ipi+0x7b/0x120
[240215.295365]  asm_sysvec_reschedule_ipi+0x1b/0x20
[240215.295367] RIP: 0033:0x5634e4bfe3d3

Describe how you solve it


Environment

PIP versions of opencv

marie# pip list | grep opencv
opencv-python                                4.8.1.78
opencv-python-headless                       4.8.1.78
root@asp-gpu032:/marie# marie --version-full
UserWarning: multiprocessing start method is set to `fork` (raised from /opt/venv/lib/python3.10/site-packages/marie/__init__.py:75)
- marie 3.0.22
- docarray 0.39.1
- jcloud not-available
- jina-hubble-sdk v0.0.0
- marie-proto 0.1.27
- protobuf 3.20.2
- proto-backend cpp
- grpcio 1.47.5
- pyyaml 6.0.1
- python 3.10.12
- platform Linux
- platform-release 6.2.0-37-generic
- platform-version #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 2 18:01:13 UTC 2
- architecture x86_64
- processor x86_64
- uid 88974383001391
- session-id 7017dca0-8e09-11ee-a1e0-50ebf67e2f2f
- uptime 2023-11-28T16:16:08.016883
- ci-vendor (unset)
- internal False
* JINA_DEFAULT_HOST (unset)
* JINA_DEFAULT_TIMEOUT_CTRL (unset)
* JINA_DEPLOYMENT_NAME (unset)
* JINA_DISABLE_UVLOOP (unset)
* JINA_EARLY_STOP (unset)
* JINA_FULL_CLI (unset)
* JINA_GATEWAY_IMAGE (unset)
* JINA_GRPC_RECV_BYTES (unset)
* JINA_GRPC_SEND_BYTES (unset)
* JINA_HUB_NO_IMAGE_REBUILD (unset)
* JINA_LOG_CONFIG (unset)
* JINA_LOG_LEVEL DEBUG
* JINA_LOG_NO_COLOR (unset)
* JINA_MP_START_METHOD fork
* JINA_OPTOUT_TELEMETRY (unset)
* JINA_RANDOM_PORT_MAX (unset)
* JINA_RANDOM_PORT_MIN (unset)
* JINA_LOCKS_ROOT (unset)
* JINA_K8S_ACCESS_MODES (unset)
* JINA_K8S_STORAGE_CLASS_NAME (unset)
* JINA_K8S_STORAGE_CAPACITY (unset)
* JINA_STREAMER_ARGS (unset)

This could be possibly related to error seen in the logs

gbugaj@asp-gpu032:~$ docker logs marieai-dev-server-corr  | grep 'Exception' -A 10 | head
Exception ignored when trying to write to the signal wakeup fd:
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 115, in _read_from_self
    data = self._ssock.recv(4096)
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 115, in _read_from_self
    data = self._ssock.recv(4096)
BlockingIOError: [Errno 11] Resource temporarily unavailable