sonyxperiadev / bug_tracker

Empty repository that is used as a bugtracker for Open Devices project
52 stars 13 forks source link

[Yoshino][Lilac] Intermittent crashes / reboots #399

Closed stefanhh0 closed 4 years ago

stefanhh0 commented 5 years ago

Platform: Yoshino Device: Lilac Kernel version: 4.9.170-g0227b68deb75 Android version: android-9.0.0_r35 OEM Image: SW_binaries_for_Xperia_Android_9.0_2.3.2_v8_yoshino.img Build: aosp_g8441-userdebug 9 PQ2A.190405.003 eng.stefan.20190502.183239 Version baseband: 1308-8921_47.1.A.16.20

Description With the latest kernel 0227b68deb75 aosp got significantly more unstable. Unfortunately I don't know exactly on which commit exactly I was when it was more stable I just remember that it must have been a commit from May 7th or 8th.

The crash and reboot happens relatively often when waking up the phone (Don't know if the phone is sleeping or in deep sleep - I just can say that the screen is off). It fails then to activate the screen and instead the screen remains dark. Then it takes some time, after that time the vibrator is activated twice with a short pause between the first and second activation of the vibrator.

I am not sure if that means a double crash - would be nice if someone could say if my assumption of a double crash is right. Additionally if so, the original crash is then not available since the information in /sys/fs/pstore is overwritten with the information from the second crash, correct?

It is not a double crash but just normal, see also comment here: https://github.com/sonyxperiadev/bug_tracker/issues/580#issuecomment-631097204:

However when it crashes, it always activates the vibrator twice, not sure if that means a double crash, no one ever confirmed it is double crashing as far as I remember, maybe not easy to say what it means?

Not that I remember, afaik it always double-buzzes for me. A double crash would be where the display at least goes past the bootloader, into the kernel (seeing the "your device is unlocked" warning) before restarting again.

After the phone is back /sys/fs/pstore contains some files: pstore.tar.gz

Symptoms Phone screen remains black when trying to activate -> double crash / reboot.

How to reproduce Happens intermittently, now solid recipe available, however it happens several times a day on normal usage.

oshmoun commented 5 years ago

the logs don't explicitly reference a crash unfortunately. however, there are some weird messages from somc_panel that shouldn't be there i think. @kholk @MarijnS95 looks like something is up with color calibration?

stefanhh0 commented 5 years ago

Had another two different crashes.

First one is like above and was a double crash where the vibrator was activated twice. Again, I guess no crash is referenced directly in the files and it is quite similar to the file I have posted above. By the way the vibrator is activated the second time (presumable 2nd crash of the double crash) before the white screen with black Sony logo appears, so it happens really early in the start-up phase: pstore-1.tar.gz It also contains the color calibration messages:

[ 3599.213047] somc_panel_color_manager: somc_panel_inject_crtc_overrides (751): Override: Already have original funcs! Is setup called twice??
[ 3599.213193] somc_panel_color_manager: somc_panel_pcc_setup (855): u,v is flashed 0.
[ 3599.213297] somc_panel_color_manager: somc_panel_colormgr_apply_calibrations: Couldn't apply PCC calibration
[ 3599.213428] somc_panel_color_manager: somc_panel_colormgr_apply_calibrations: Cannot send HSIC calibration
[ 3637.044419] somc_panel_color_manager: somc_panel_inject_crtc_overrides (751): Override: Already have original funcs! Is setup called twice??
[ 3637.044757] somc_panel_color_manager: somc_panel_pcc_setup (855): u,v is flashed 0.
[ 3637.045119] somc_panel_color_manager: somc_panel_colormgr_apply_calibrations: Couldn't apply PCC calibration
[ 3637.045584] somc_panel_color_manager: somc_panel_colormgr_apply_calibrations: Cannot send HSIC calibration

I have extracted also other messages that may indicate a problem or not:

[ 3853.426166] CPU features: SANITY CHECK: Unexpected variation in SYS_ID_AA64MMFR0_EL1. Boot CPU: 0x00000000001122, CPU4: 0x00000000101122

This messages appears often but only for the cpus 4 to 7 not for 0 to 3.


[ 3771.593884] kgsl kgsl-3d0: |counter_delta| Abnormal value:0x101b85b (0x1026c0d) from perf counter : 0x3b0

[ 3636.506097]  cache: parent cpu3 should not be sleeping

Comes for cpus 1 to 6 not for 0 and 7


[ 3739.948268] CHRDEV "qcwlanstate" major number 220 goes below the dynamic allocation range

[ 3739.951842] ipa ipa3_uc_reg_rdyCB:1774 bad parm. inout=          (null) [ 3739.980534] ipa ipa3_uc_reg_rdyCB:1774 bad parm. inout=          (null) [ 3739.982445] send_filled_buffers_to_user: Send Failed -22 drop_count = 1
[ 3739.987369] ipa ipa3_uc_reg_rdyCB:1774 bad parm. inout=          (null) [ 3740.044557] IPC_RTR: process_new_server_msg: Server 00001003 create rejected, version = 0

[ 3739.938030] wlan: Loading driver v5.1.1.69T ()
[ 3740.200959] cnss_utils: WLAN MAC address is not set, type 0

After the aboves double-crash the phone booted and shortly after having booted it crashed and rebooted again, this time only with the vibrator being activated once. In the pstore files there is more information, kernel call-stack and also buffer underfow errors. However I think those two kind of crashes are something completely different: pstore-2.tar.gz

stefanhh0 commented 5 years ago

On a fresh build I got just two more double crashes (vibrator was activated twice):

Kernel version: 4.9.174-gd7ab313501f1 Android version: android-9.0.0_r37 OEM Image: SW_binaries_for_Xperia_Android_9.0_2.3.2_v8_yoshino.img Build: aosp_g8441-userdebug 9 PQ3A.190505.002 eng.stefan.20190514.014237 Version baseband: 1308-8921_47.1.A.16.20

pstore-1.tar.gz pstore-2.tar.gz

This time the files contain Kernel-Exceptions I hope those files are more helpful for you finding the root-cause of the crashes.

stefanhh0 commented 5 years ago

Just got two more of those double crashes. As of time of writing this comment I am on the latest commits: Kernel version: 4.9.174-g6f8c28697397 Build: aosp_g8441-userdebug 9 PQ3A.190505.002 eng.stefan.20190514.185903

pstore-1.tar.gz pstore-2.tar.gz

stefanhh0 commented 5 years ago

Kernel version: 4.9.174-gfc821b0441e9 Android version: android-9.0.0_r37 OEM Image: SW_binaries_for_Xperia_Android_9.0_2.3.2_v8_yoshino.img Build: aosp_g8441-userdebug 9 PQ3A.190505.002 eng.stefan.20190516.231020 Version baseband: 1308-8921_47.1.A.16.20

With a fresh up-to-date build from yesterday late evening it is still occurring it happened several times already shortly after I have flashed and booted this morning. The overall stability of yoshino is currently poor, the last two days the double crashes/reboots occurred several times a day, I guess around 8 to 10 times, I haven't counted exactly. Can someone confirm my observations? Are the pstore contents somehow useful or is there something I could provide to you additionally that would help you finding the root cause(s) of the the problem(s)?

Here are two pstores that occurred on that build, the two pstores look differently content-wise. pstore.tar.gz pstore-2.tar.gz

Just in case, I have also saved a dmesg file from a fresh booted system when it managed to startup just normally: dmesg.log

MarijnS95 commented 5 years ago

@oshmoun @stefanhh0 The color calibrations are nothing to worry about, though annoying (spammy) and wasting CPU cycles. As far as I understand, retrieving a 0, 0 calibration from the display indicates an error has occured (following the code), but after some discussion it seems this is a valid case where the display shows "ground truth" without needing any extra adjustment.

I propose to check every device that exhibits this behaviour, and decide:

  1. An issue in somc,mdss-dsi-uv-command, conv_uv_data, or another piece of code (specific to Yoshino);
  2. Enable somc,mdss-dsi-pcc-force-cal for Yoshino;
  3. Remove the check altogether, when 0, 0 is a valid response for every display and platform.

I do have a Mermaid here that prints the same result, but didn't manage to check whether this is normal.

In the end, even just making the code continue without doing the setup every time the PCC "changes" (happens when the display turns on) will save on noise and cycles.

stefanhh0 commented 5 years ago

With the build: Kernel version: 4.9.174-ga85976871290 Build: aosp_g8441-userdebug 9 PQ3A.190505.002 eng.stefan.20190522.213614 AOSP is a lot more stable then it used to be before. Not a single crash in the last 14 hours (uptime: 13:55). That is really a huge improvement for the overall stability.

With the older builds including the build before the current one: Kernel version: 4.9.174-g7f4c5dfbbd84 Build: aosp_g8441-userdebug 9 PQ3A.190505.002 eng.stefan.20190521.193309 AOSP used to crash several times a day, so the changes between those two builds have improved the stability. The timestamp in Build: reflects the time when the repos have been synced, since I always build shortly after syncing.

I would like to keep the ticket open since I don't know if the double crash issue is as well fixed and to see if the coming builds confirm the improvement in overall stability. I will get back during the next week with my findings.

Good job and thank you all for bringing back some stability to AOSP!

stefanhh0 commented 5 years ago

Platform: Yoshino Device: Lilac Kernel version: 4.9.174-ga85976871290 Android version: android-9.0.0_r37 OEM Image: SW_binaries_for_Xperia_Android_9.0_2.3.2_v8_yoshino.img Build: aosp_g8441-userdebug 9 PQ3A.190505.002 eng.stefan.20190522.213614 Version baseband: 1308-8921_47.1.A.16.20

Having that said, the phone had another double crash. This time also a dmesg file was written: pstore.tar.gz

To my surprise the dmesg file references the previous kernel:

<6>[27876.006251] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S W 4.9.174-g7f4c5dfbbd84 #1 And not the current one 4.9.174-ga85976871290 (which then again shows up when I boot the phone and type uname -a). Is that due to the A/B mechanics? How can I force that on A and B are the same images? I think it would make sense to no longer have the unstable version with kernel 4.9.174-g7f4c5dfbbd84 at all on my phone, to get a glimpse on the original logs that have been written when the phone crashed the first time.
stefanhh0 commented 5 years ago

No, totally wrong I realized that the timestamp of the dmesg file is from the early morning, so it is an old file that has nothing to do with the other files in the pstore.tar.gz and the latest double crash.

oshmoun commented 5 years ago

console-ramoops-0 should be the file containing dmesg of the directly preceding boot dmesg-ramoops-x are of previous boots, so that explains the old kernel version

stefanhh0 commented 5 years ago

Kernel version: 4.9.177-g69cc32601555 Android version: android-9.0.0_r37 OEM Image: SW_binaries_for_Xperia_Android_9.0_2.3.2_v8_yoshino.img Build: aosp_g8441-userdebug 9 PQ3A.190505.002 eng.stefan.20190523.192026 Version baseband: 1308-8921_47.1.A.16.20

The double crash/reboot occurs as well on 4.7.9.177 when I booted the device the very first time directly after flashing the new build. After the double crash/reboot the device started successfully.

This time several exceptions are logged: pstore.tar.gz

stefanhh0 commented 5 years ago

Device: Lilac Platform: Yoshino Kernel version: 4.9.182-gfa7fb2c467d4-dirty Android version: android-9.0.0_r37 Software binaries version: SW_binaries_for_Xperia_Android_9.0_2.3.2_v9_yoshino.zip Version baseband: 1308-8921_47.1.A.16.20 Build: aosp_g8441-userdebug 9 PQ3A.190505.002 eng.stefan.20190619.192046

Description Just an update, it is still happening on a fresh clean build. Phone was connected to usb, screen was off. Trying to activate the phone via fingerprint sensor. The phone did not activate the screen, instead after some seconds the already reported double crash and reboot occurred.

Symptoms In various sItuations: hang -> double crash -> reboot.

How to reproduce No reliable recipe found yet, however it happens from time to time when using the phone (every several hours)

Additional context Again exceptions and other failed and error messages can be found in console-ramoops-0, however I can't say whether or not those messages are somehow helpful in identifying the root cause of the problem. pstore.tar.gz

stefanhh0 commented 5 years ago

Kernel version: 4.9.182-g6593c13acef8 Android version: android-9.0.0_r44 Software binaries version: SW_binaries_for_Xperia_Android_9.0_2.3.2_v9_yoshino.zip Version baseband: 1308-8921_47.1.A.16.20 Build: aosp_g8441-userdebug 9 PQ3A.190705.003 eng.stefan.20190705.182120

Just another update, it is still happening on latest android/kernel: pstore.tar.gz

kholk commented 5 years ago

I know about the crashes during charging, that's due to the charger thermal zone going nuts after multiple suspend-resume cycles. It's safe, because we are monitoring lots of zones and the one that goes nuts is a duplicate of what we already check.... But we cannot remove it, otherwise the charger stops working.....

This kind of crash cannot be resolved on kernel 4.9 due to the fact that the entire RPM framework is royally f*****. Or at least I have never found a way to.

Regarding the other kind of crashes, these may be due to a clock being stuck and failing gracefully, producing the apparently-all-ok crash behavior.

I have been able to solve some issues on other platforms on kernel 4.9 during the 4.14 porting (because I've had to examine the entire thing again).... I've reached Yoshino 2 days ago, let's see if I can spot anything on there!

P.S.: I think you deserve this info. On kernel 4.14 the RPM was finally migrated to the upstream RPMSG API, which is solving most of the big issues that the old crapped one on 4.9 currently has. It's not an excuse or something but, in case we can't do anything good here, there's a good hope for the future, I think.

stefanhh0 commented 5 years ago

Thanks a lot for the info. I am happy getting some feedback and looking forward for kernel 4.14. Despite the crashes from time to time the phone is all in all useable with aosp.

stefanhh0 commented 4 years ago

Platform: Yoshino Device: Lilac Kernel version: 4.14.176-gf0356fa3bcac:

Android version: android-10.0.0_r36 Software binaries version: SW_binaries_for_Xperia_Android_10.0.7.1_r1_v6_yoshino.img Version baseband: 1307-7511_47.2.A.11.228 Build: aosp_g8441-userdebug 10 QQ2A.200501.001.B3 eng.stefan.20200510.170705

Retestet, phone was connected via usb. Before the reboot I have removed all files in /sys/fs/pstore.

After system was up again I found following files in /sys/fs/pstore

console-ramoops-0.log dmesg-ramoops-0.log dmesg-ramoops-1.log pmsg-ramoops-0.log

stefanhh0 commented 4 years ago

Well, it is currently just a reboot issue and not the original problem. Just let me know if I should open a new clean issue, but the basic info is anyway in my previous comment.

stefanhh0 commented 4 years ago

I am just closing this issue in favor of #580 this one is just very old and after I could clarify my mis-interpretation of experiencing a double-crash there is no need to keep this one open.