status-im / status-desktop

Status Desktop client made in Nim & QML
https://status.app
Mozilla Public License 2.0
287 stars 78 forks source link

App segmentation faults on CI hosts #12816

Open jakubgs opened 10 months ago

jakubgs commented 10 months ago

Bug Report

Description

Currently all CI hosts show frequently segmentation faults for two processes: nim_status_client and QQmlThread

admin@linux-04.he-eu-hel1.ci.devel:~ % sudo dmesg | grep segfault | tail
[2804131.159498] QQmlThread[638530]: segfault at 15 ip 00007fd3a3e92c30 sp 00007fd1727f6f20 error 4 in libQt5Qml.so.5[7fd3a3c00000+45e000]
[2804472.224156] QQmlThread[641553]: segfault at 15 ip 00007ff13c692c30 sp 00007fef0a7f6f20 error 4 in libQt5Qml.so.5[7ff13c400000+45e000]
[2805466.093454] nim_status_clie[687484]: segfault at 8 ip 00007f554fa62250 sp 00007ffebb1ee038 error 4 in libQt5Qml.so.5[7f554f800000+45e000]
[2813930.448108] nim_status_clie[779010]: segfault at 8 ip 00007f7befa62250 sp 00007fff36e49458 error 4 in libQt5Qml.so.5[7f7bef800000+45e000]
[2835137.051823] QQmlThread[1403420]: segfault at 13 ip 00007febc9892c30 sp 00007fe9a5ff5f20 error 4 in libQt5Qml.so.5[7febc9600000+45e000]
[2842776.720349] nim_status_clie[1694567]: segfault at 8 ip 00007f6ed2c62250 sp 00007ffff3456488 error 4 in libQt5Qml.so.5[7f6ed2a00000+45e000]
[2845712.707703] nim_status_clie[1734920]: segfault at 8 ip 00007fd7cf662250 sp 00007ffc1b58b808 error 4 in libQt5Qml.so.5[7fd7cf400000+45e000]
[2848523.822515] nim_status_clie[2088903]: segfault at 8 ip 00007f878e462250 sp 00007ffda6290988 error 4 in libQt5Qml.so.5[7f878e200000+45e000]
[2850619.291393] nim_status_clie[2372416]: segfault at 8 ip 00007f5883a62250 sp 00007ffcc84bfd78 error 4 in libQt5Qml.so.5[7f5883800000+45e000]
[2851205.983205] QQmlThread[2411192]: segfault at 15 ip 00007f2680892c30 sp 00007f24fdff5f20 error 4 in libQt5Qml.so.5[7f2680600000+45e000]

Steps to reproduce

Unknown, but my best guess is that this is a result of E2E test runs.

jakubgs commented 10 months ago

This has been going on for a while - at least since June - as I've seen these when debugging linux builds being killed:

jakubgs commented 10 months ago

There's also general protection fault errors, but less frequent:

admin@linux-01.he-eu-hel1.ci.devel:~ % sudo dmesg | grep 'general protection fault'
[11549645.775740] traps: QQmlThread[1141503] general protection fault ip:7f7fbe492c30 sp:7f7fb3ffde90 error:0 in libQt5Qml.so.5.15.2[7f7fbe200000+45e000]
[11601779.978722] traps: QQmlThread[3166153] general protection fault ip:7fb21fa92c30 sp:7fb0337f8f20 error:0 in libQt5Qml.so.5.15.2[7fb21f800000+45e000]
jakubgs commented 10 months ago

I will attempt to audit the processes to find exactly what spawns them to confirm it's E2E tests.

Most probably using auditd or bpfcc-tools.

jakubgs commented 10 months ago

Looks like sing execsnoop from bpfcc-tools doesn't cut it. For some reason I'm getting no results:

TIME     TIME(s) UID   PCOMM            PID    PPID   RET ARGS
Possibly lost 55 samples
Possibly lost 2 samples
Possibly lost 12 samples
Possibly lost 14 samples
Possibly lost 5 samples
Possibly lost 165 samples
...

I started these commands on linux-01 and linux-05 in a tmux session yesterday:

execsnoop-bpfcc -x -t -T -U -u jenkins -n nim_status_client

But I got nothing. Probably using auditd will work out better.

jakubgs commented 10 months ago

I added an audit filter using:

sudo auditctl -a task,always -F uid=jenkins

Lets see what we can find.

jakubgs commented 10 months ago

Looks like I used a wrong filter because I got nothing:

admin@linux-01.he-eu-hel1.ci.devel:~ % sudo ausearch -i -sc execve
<no matches>
jakubgs commented 10 months ago

These segfaults are not that frequent, one or two per day:

admin@linux-02.he-eu-hel1.ci.devel:~ % sudo dmesg -T | grep segfault
[Fri Nov 17 13:15:39 2023] QQmlThread[2932696]: segfault at 15 ip 00007f16ab692c30 sp 00007f14dd3f9f20 error 4 in libQt5Qml.so.5.15.2[7f16ab400000+45e000]
[Fri Nov 17 17:13:14 2023] nim_status_clie[772385]: segfault at 20 ip 00007f056d86dc1e sp 00007ffcd71cab80 error 4 in libgcc_s.so.1[7f056d85b000+17000]
[Mon Nov 20 15:38:25 2023] QQmlThread[1402142]: segfault at 640065038f ip 00007f37a16c9d93 sp 00007f35ba3f9ef0 error 4 in libQt5Qml.so.5.15.2[7f37a1600000+45e000]
[Tue Nov 21 14:15:34 2023] QQmlThread[49389]: segfault at 7f4a3044b214 ip 00007f4dd6092c30 sp 00007f4dcbbfde90 error 4 in libQt5Qml.so.5.15.2[7f4dd5e00000+45e000]
[Tue Nov 21 18:18:22 2023] QQmlThread[1415890]: segfault at 0 ip 0000000000000000 sp 00007f826a3f9f38 error 14
[Wed Nov 22 11:19:09 2023] QQmlThread[2686601]: segfault at 44 ip 00007f4697492c30 sp 00007f44c13f9f20 error 4 in libQt5Qml.so.5.15.2[7f4697200000+45e000]
jakubgs commented 10 months ago

All Linux CI hosts now have an audit rule for new processes:

 > ansible ci-slave-linux -o -a 'auditctl -l'     
linux-01.he-eu-hel1.ci.devel | CHANGED | rc=0 | (stdout) -a always,task
linux-02.he-eu-hel1.ci.devel | CHANGED | rc=0 | (stdout) -a always,task
linux-03.he-eu-hel1.ci.devel | CHANGED | rc=0 | (stdout) -a always,task
linux-04.he-eu-hel1.ci.devel | CHANGED | rc=0 | (stdout) -a always,task
linux-05.he-eu-hel1.ci.devel | CHANGED | rc=0 | (stdout) -a always,task
jakubgs commented 10 months ago

It looks like I can't use auditd to catch what processes are segfaulting because it absolutely murdered I/O on CI hosts:

image

So I uninstalled it.

jakubgs commented 9 months ago

Managed to catch some examples:

[Mon Nov 27 10:33:24 2023] nim_status_clie[1445035]: segfault at 8 ip 00007f923fc62250 sp 00007ffddf46f9b8 error 4 in libQt5Qml.so.5[7f923fa00000+45e000]
admin@linux-04.he-eu-hel1.ci.devel:~ % grep 1445035 snoop.log
10:33:20 76378.2541001  startaut         1445035 1445034   0 /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103320_056118 --LOG_LEVEL=DEBUG
10:33:20 76378.2551001  StatusIm-Deskto  1445035 1445034   0 tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103320_056118 --LOG_LEVEL=DEBUG
10:33:20 76378.2621001  AppRun           1445035 1445034   0 /tmp/.mount_StatushwHnNZ/AppRun -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103320_056118 --LOG_LEVEL=DEBUG
10:33:20 76378.2671001  dirname          1445041 1445035   0 /usr/bin/dirname /tmp/.mount_StatushwHnNZ/AppRun
10:33:20 76378.2711001  nim_status_clie  1445035 1445034   0 /tmp/.mount_StatushwHnNZ/usr/bin/nim_status_client -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103320_056118 --LOG_LEVEL=DEBUG

It appears this one came from /home/jenkins/workspace/status-desktop/e2e/prs@3/.

jakubgs commented 9 months ago

It appears to be this job: https://ci.status.im/job/status-desktop/job/e2e/job/prs/634/ Which failed with:

hair pipe affair cage vote estate gloom lamp robust like-user_account0] _
11:36:37  
11:36:37  aut_id = 'AUT_095742', timeout_sec = 10
11:36:37  
11:36:37      @allure.step('Attaching to "{0}"')
11:36:37      def attach(aut_id: str, timeout_sec: int = configs.timeouts.PROCESS_TIMEOUT_SEC):
11:36:37          started_at = time.monotonic()
11:36:37          _logger.debug(f'Attaching to {aut_id}')
11:36:37          while True:
11:36:37              try:
11:36:37  >               context = squish.attachToApplication(aut_id, SquishServer().host, SquishServer().port)
11:36:37  E               RuntimeError: Application 'AUT_095742' could not be started. Received invalid application id.
11:36:37  
11:36:37  driver/context.py:19: RuntimeError

Also:

11:36:37  >               context = squish.attachToApplication(aut_id, SquishServer().host, SquishServer().port)
11:36:37  E               RuntimeError: Timeout expired during handshake

The app crashed at 10:33:24 and the tests started at 10:32:20:

10:32:20  + python3 -m pytest --disable-warnings --alluredir=./allure-results

Though the actual execution time of the app is 10:32:59:

10:32:59  09:32:58.874    INFO scripts.utils.local_system  Execute: "/opt/squish-runner-7.1-20230222-1555/bin/squishserver" --configfile /home/jenkins/workspace/status-desktop/e2e/prs/squish_server.ini --config addAttachableAUT AUT_093221 localhost:62134
10:32:59  09:32:58.882    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62134 "tmp/StatusIm-Desktop-231127-091406-64a0bc-pr12866-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs/tmp/results/run_27112023_093220/status/app_093221_426178 --LOG_LEVEL=DEBUG 
10:33:05  09:33:04.450    INFO driver.context  AUT: AUT_093221 attache

Which does fit very closely with the segmentation fault above.

jakubgs commented 9 months ago

Although I'm not 100% convinced as the filter doesn't match in this job:

Running on linux-04 in /home/jenkins/workspace/status-desktop/e2e/prs
jakubgs commented 9 months ago

This one seems to fit better: https://ci.status.im/job/status-desktop/job/e2e/job/prs/638/console

Running on linux-04 in /home/jenkins/workspace/status-desktop/e2e/prs@3

And so does the startup time:

11:28:52  + python3 -m pytest --disable-warnings --alluredir=./allure-results
...
12:58:49  Cancelling nested steps due to timeout
12:58:49  Sending interrupt signal to process
12:59:09  After 20s process did not stop

Looks like this is one of our favorite timeouts.

jakubgs commented 9 months ago

And this run matches very well will the multiple sefaults in a row:

[Mon Nov 27 10:28:58 2023] nim_status_clie[1439693]: segfault at 8 ip 00007ff6bfa62250 sp 00007ffd15d74558 error 4 in libQt5Qml.so.5[7ff6bf800000+45e000]
[Mon Nov 27 10:29:09 2023] nim_status_clie[1439981]: segfault at 8 ip 00007f7a2e862250 sp 00007ffcf58ff048 error 4 in libQt5Qml.so.5[7f7a2e600000+45e000]
[Mon Nov 27 10:29:20 2023] nim_status_clie[1440154]: segfault at 8 ip 00007f92acc62250 sp 00007ffd7e91e668 error 4 in libQt5Qml.so.5[7f92aca00000+45e000]
[Mon Nov 27 10:29:32 2023] nim_status_clie[1440322]: segfault at 8 ip 00007fe6de062250 sp 00007fff62f32558 error 4 in libQt5Qml.so.5[7fe6dde00000+45e000]
[Mon Nov 27 10:29:43 2023] nim_status_clie[1440616]: segfault at 8 ip 00007f6193262250 sp 00007ffce667cfe8 error 4 in libQt5Qml.so.5[7f6193000000+45e000]
[Mon Nov 27 10:29:54 2023] nim_status_clie[1440790]: segfault at 8 ip 00007f3e03462250 sp 00007ffcbd04df28 error 4 in libQt5Qml.so.5[7f3e03200000+45e000]
[Mon Nov 27 10:30:05 2023] nim_status_clie[1440951]: segfault at 8 ip 00007f5949e62250 sp 00007ffed150fee8 error 4 in libQt5Qml.so.5[7f5949c00000+45e000]
[Mon Nov 27 10:30:16 2023] nim_status_clie[1441353]: segfault at 8 ip 00007fab86862250 sp 00007ffcd6a1edb8 error 4 in libQt5Qml.so.5[7fab86600000+45e000]
[Mon Nov 27 10:30:27 2023] nim_status_clie[1441561]: segfault at 8 ip 00007ffaba062250 sp 00007ffe1d186ae8 error 4 in libQt5Qml.so.5[7ffab9e00000+45e000]
[Mon Nov 27 10:30:48 2023] nim_status_clie[1441895]: segfault at 8 ip 00007fc06a862250 sp 00007ffdcffb1ac8 error 4 in libQt5Qml.so.5[7fc06a600000+45e000]
[Mon Nov 27 10:31:00 2023] nim_status_clie[1442187]: segfault at 8 ip 00007f9355c62250 sp 00007ffe46a1e558 error 4 in libQt5Qml.so.5[7f9355a00000+45e000]
[Mon Nov 27 10:31:11 2023] nim_status_clie[1442358]: segfault at 8 ip 00007fc187662250 sp 00007fffd7537ce8 error 4 in libQt5Qml.so.5[7fc187400000+45e000]
[Mon Nov 27 10:31:22 2023] nim_status_clie[1442525]: segfault at 8 ip 00007ff77fe62250 sp 00007ffd49c48c38 error 4 in libQt5Qml.so.5[7ff77fc00000+45e000]
[Mon Nov 27 10:31:33 2023] nim_status_clie[1442820]: segfault at 8 ip 00007f1443462250 sp 00007ffe6f03d988 error 4 in libQt5Qml.so.5[7f1443200000+45e000]
[Mon Nov 27 10:31:44 2023] nim_status_clie[1442969]: segfault at 8 ip 00007efec8a62250 sp 00007ffe6a3c0758 error 4 in libQt5Qml.so.5[7efec8800000+45e000]
[Mon Nov 27 10:31:55 2023] nim_status_clie[1443158]: segfault at 8 ip 00007f06f2662250 sp 00007fff44170d08 error 4 in libQt5Qml.so.5[7f06f2400000+45e000]
[Mon Nov 27 10:32:06 2023] nim_status_clie[1443432]: segfault at 8 ip 00007f97e1062250 sp 00007ffc19cc5e48 error 4 in libQt5Qml.so.5[7f97e0e00000+45e000]
[Mon Nov 27 10:32:17 2023] nim_status_clie[1443624]: segfault at 8 ip 00007f0fd8662250 sp 00007ffde7388eb8 error 4 in libQt5Qml.so.5[7f0fd8400000+45e000]
[Mon Nov 27 10:32:29 2023] nim_status_clie[1443932]: segfault at 8 ip 00007fc905662250 sp 00007ffd84d76bb8 error 4 in libQt5Qml.so.5[7fc905400000+45e000]
[Mon Nov 27 10:32:40 2023] nim_status_clie[1444078]: segfault at 8 ip 00007f7f26862250 sp 00007fffbc489108 error 4 in libQt5Qml.so.5[7f7f26600000+45e000]
[Mon Nov 27 10:32:51 2023] nim_status_clie[1444384]: segfault at 8 ip 00007ffb71462250 sp 00007ffe2e490958 error 4 in libQt5Qml.so.5[7ffb71200000+45e000]
[Mon Nov 27 10:33:02 2023] nim_status_clie[1444569]: segfault at 8 ip 00007f45b2c62250 sp 00007ffccc7df4d8 error 4 in libQt5Qml.so.5[7f45b2a00000+45e000]
[Mon Nov 27 10:33:13 2023] nim_status_clie[1444781]: segfault at 8 ip 00007fe55b662250 sp 00007ffc9f567978 error 4 in libQt5Qml.so.5[7fe55b400000+45e000]
[Mon Nov 27 10:33:24 2023] nim_status_clie[1445035]: segfault at 8 ip 00007f923fc62250 sp 00007ffddf46f9b8 error 4 in libQt5Qml.so.5[7f923fa00000+45e000]

Here we go:

10:28:53.964    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_102853_845657 --LOG_LEVEL=DEBUG 
10:29:05.065    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_102853_845657 --LOG_LEVEL=DEBUG 
10:29:16.309    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_102853_845657 --LOG_LEVEL=DEBUG 
10:29:27.450    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_102927_320435 --LOG_LEVEL=DEBUG 
10:29:38.559    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_102927_320435 --LOG_LEVEL=DEBUG 
10:29:49.744    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_102927_320435 --LOG_LEVEL=DEBUG 
10:30:00.843    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103000_719513 --LOG_LEVEL=DEBUG 
10:30:11.799    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103000_719513 --LOG_LEVEL=DEBUG 
10:30:22.916    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103000_719513 --LOG_LEVEL=DEBUG 
10:30:33.979    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103033_844503 --LOG_LEVEL=DEBUG 
10:30:44.371    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103033_844503 --LOG_LEVEL=DEBUG 
10:30:55.357    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103033_844503 --LOG_LEVEL=DEBUG 
10:31:06.793    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103106_675530 --LOG_LEVEL=DEBUG 
10:31:17.841    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103106_675530 --LOG_LEVEL=DEBUG 
10:31:28.805    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103106_675530 --LOG_LEVEL=DEBUG 
10:31:40.097    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103139_973454 --LOG_LEVEL=DEBUG 
10:31:51.146    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103139_973454 --LOG_LEVEL=DEBUG 
10:32:02.166    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103139_973454 --LOG_LEVEL=DEBUG 
10:32:13.388    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103213_266041 --LOG_LEVEL=DEBUG 
10:32:24.423    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103213_266041 --LOG_LEVEL=DEBUG 
10:32:35.578    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103213_266041 --LOG_LEVEL=DEBUG 
10:32:46.675    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103246_551419 --LOG_LEVEL=DEBUG 
10:32:57.906    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103246_551419 --LOG_LEVEL=DEBUG 
10:33:08.919    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103246_551419 --LOG_LEVEL=DEBUG 
10:33:20.177    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103320_056118 --LOG_LEVEL=DEBUG 

This is the app being tested: https://ci.status.im/job/status-desktop/job/prs/job/linux/job/x86_64/job/package/job/PR-12856/7/

https://status-im-desktop-prs.ams3.digitaloceanspaces.com/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.tar.gz

jakubgs commented 9 months ago

I attempted to run this AppImage locally and indeed it segfaults:

Traceback (most recent call last)
/home/jenkins/workspace/rs_linux_x86_64_package_PR-12856@2/src/nim_status_client.nim(208) nim_status_client
/home/jenkins/workspace/rs_linux_x86_64_package_PR-12856@2/src/nim_status_client.nim(181) mainProc
/home/jenkins/workspace/rs_linux_x86_64_package_PR-12856@2/src/app/boot/app_controller.nim(298) delete
/home/jenkins/workspace/rs_linux_x86_64_package_PR-12856@2/src/app/modules/startup/io_interface.nim(19) delete
/home/jenkins/workspace/rs_linux_x86_64_package_PR-12856@2/src/app/modules/startup/module.nim(87) delete
/home/jenkins/workspace/rs_linux_x86_64_package_PR-12856@2/vendor/nimqml/src/nimqml/private/qqmlapplicationengine.nim(29) setRootContextProperty
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

Commented on the PR: https://github.com/status-im/status-desktop/pull/12856

jakubgs commented 9 months ago

The fact that we do not log ANYTHING about how the app crashed when being started is a major issue:

11:30:01  -------------------------------- live log setup --------------------------------
11:30:01  10:30:00.836    INFO scripts.utils.local_system  Execute: "/opt/squish-runner-7.1-20230222-1555/bin/squishserver" --configfile /home/jenkins/workspace/status-desktop/e2e/prs@3/squish_server.ini --config addAttachableAUT AUT_103000 localhost:62138
11:30:01  10:30:00.843    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103000_719513 --LOG_LEVEL=DEBUG 
11:30:12  10:30:11.794    INFO scripts.utils.local_system  Execute: "/opt/squish-runner-7.1-20230222-1555/bin/squishserver" --configfile /home/jenkins/workspace/status-desktop/e2e/prs@3/squish_server.ini --config addAttachableAUT AUT_103000 localhost:62138
11:30:12  10:30:11.799    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103000_719513 --LOG_LEVEL=DEBUG 
11:30:24  10:30:22.906    INFO scripts.utils.local_system  Execute: "/opt/squish-runner-7.1-20230222-1555/bin/squishserver" --configfile /home/jenkins/workspace/status-desktop/e2e/prs@3/squish_server.ini --config addAttachableAUT AUT_103000 localhost:62138
11:30:24  10:30:22.916    INFO scripts.utils.local_system  Execute: /opt/squish-runner-7.1-20230222-1555/bin/startaut --port=62138 "tmp/StatusIm-Desktop-231127-100907-44682e-pr12856-x86_64.AppImage" -d=/home/jenkins/workspace/status-desktop/e2e/prs@3/tmp/results/run_27112023_102853/status/app_103000_719513 --LOG_LEVEL=DEBUG 
11:30:35  ERROR 

All we get are 3 retries and then just ERROR, which is useless.

jakubgs commented 9 months ago

The fact that none of this was logged in the E2E run means that there's severe issues in this file: https://github.com/status-im/desktop-qa-automation/blob/master/driver/aut.py

I needs major refactoring to address several issues:

jakubgs commented 7 months ago

This is still an issue, we can still find segfaults in logs for Linux CI hosts:

jakubgs@linux-02.he-eu-hel1.ci.devel:~ % sudo dmesg -T | grep segfault
[Thu Feb  8 08:56:42 2024] QQmlThread[1332334]: segfault at 15 ip 00007f8b0e492c30 sp 00007f88cbff9680 error 4 in libQt5Qml.so.5[7f8b0e200000+45e000]
[Thu Feb  8 14:47:26 2024] QQmlThread[3500731]: segfault at 15 ip 00007ff409092c30 sp 00007ff1d1ff5680 error 4 in libQt5Qml.so.5[7ff408e00000+45e000]
[Thu Feb  8 15:23:20 2024] QQmlThread[3610479]: segfault at 2c ip 00007fdea1892c30 sp 00007fdc667f6ea0 error 4 in libQt5Qml.so.5[7fdea1600000+45e000]
[Fri Feb  9 07:22:35 2024] QQmlThread[1026495]: segfault at 400000015 ip 00007f8549692c30 sp 00007f83167f6680 error 4 in libQt5Qml.so.5[7f8549400000+45e000]
[Sat Feb 10 18:40:03 2024] QQmlThread[1312983]: segfault at 15 ip 00007ff744692c30 sp 00007ff5117f4680 error 4 in libQt5Qml.so.5[7ff744400000+45e000]
[Mon Feb 12 08:31:02 2024] QQmlThread[3011691]: segfault at 15 ip 00007f7f1b292c30 sp 00007f7cecff3ea0 error 4 in libQt5Qml.so.5[7f7f1b000000+45e000]
anastasiyaig commented 4 months ago

@jakubgs did you observe any of such seg faults recently on linux hosts? I think it is all gone after the refactoring of tests but want to double check if that is still happening

jakubgs commented 3 months ago

Plenty:

jakubgs@linux-01.he-eu-hel1.ci.devel:~ % sudo dmesg | grep segfault | tail
[8242524.181436] QQmlThread[3799208]: segfault at 14 ip 00007fa5a6692c30 sp 00007fa40effc2a0 error 4 in libQt5Qml.so.5.15.2[7fa5a6400000+45e000]
[8418999.075082] QQmlThread[2846267]: segfault at 15 ip 00007f0f79492c30 sp 00007f0de2778e90 error 4 in libQt5Qml.so.5.15.2[7f0f79200000+45e000]
[9399060.430396] QQmlThread[3374280]: segfault at 15 ip 00007f1959892c30 sp 00007f185a7fde90 error 4 in libQt5Qml.so.5.15.2[7f1959600000+45e000]
[9540351.989685] QQmlThread[1606813]: segfault at 2c ip 00007f9ef0892c30 sp 00007f9df97fe2a0 error 4 in libQt5Qml.so.5.15.2[7f9ef0600000+45e000]
[9540357.390422] QQmlThread[1606978]: segfault at 14 ip 00007fada4892c30 sp 00007fac1a7fb2a0 error 4 in libQt5Qml.so.5.15.2[7fada4600000+45e000]
[9658341.995155] QQmlThread[935229]: segfault at 15 ip 00007f5eb9692c30 sp 00007f5c9cff3640 error 4 in libQt5Qml.so.5[7f5eb9400000+45e000]
[10080334.493555] QmlTests[1019106]: segfault at 7fffd82cefe8 ip 00007f0f51c103a7 sp 00007fffd82cefd0 error 6 in libQt5QuickTest.so.5.15.2[7f0f51c00000+1e000]
[10091569.163046] QQmlThread[1728226]: segfault at 200000028 ip 00007f89824c9d93 sp 00007f887f3fde60 error 4 in libQt5Qml.so.5.15.2[7f8982400000+45e000]
[10091573.377684] QQmlThread[1731887]: segfault at 7f515ecd1368 ip 00007f52da2c9d93 sp 00007f5129ffa270 error 4 in libQt5Qml.so.5.15.2[7f52da200000+45e000]
[10167706.124206] QmlTests[1703447]: segfault at 7ffd33b9cff8 ip 00007fceeaa10392 sp 00007ffd33b9d000 error 6 in libQt5QuickTest.so.5.15.2[7fceeaa00000+1e000]
anastasiyaig commented 3 months ago

alright, i know what they are, indeed sometimes linux crashes on startup. Our hope is switching to qt 5.15.8