sonyxperiadev / bug_tracker

Empty repository that is used as a bugtracker for Open Devices project
52 stars 13 forks source link

"Sensor is null" on kugo and Android 8.1 #515

Closed yshui closed 3 years ago

yshui commented 4 years ago

Device: kugo Kernel version: 4.4 Android version: 8.1 Software binaries version: v17

Description The phone app crashes after boot, similar to #265

yshui commented 4 years ago

Looks like starting sensors on sys.qcom.devup=1 does solve this problem

ix5 commented 4 years ago

Can you prepare a PR against o-mr1 if that indeed reliable fixes the issue?

yshui commented 4 years ago

@ix5 I was hoping someone with more knowledge than me could investigate the problem and figure out the root cause. I just did revert some changes without understanding them.

chris42 commented 4 years ago

@yshui Do you have some information on how to reproduce the error?

I have a Kugo running on Android 10 and checked logs for the error. I only found it once in multiple boots. Is there anything else that I need to look for or do?

yshui commented 4 years ago

@chris42 it happens to me consistently on Android 8.1

chris42 commented 4 years ago

This would go in PRODUCT_PROPERTY_OVERRIDES, right?

MarijnS95 commented 4 years ago

@yshui Would you mind sharing some logs? I'm curious if/why sensors does not come up unless manually started in property:sys.qcom.devup=1. It is either started too early or not at all, apparently.

Unfortunately 8.1 is really far ago, is there any reason you're still stuck there apart using a specific ROM like SFOS, or kernel 4.4 for stability? We would be better able to help you on Android 9, or Android 10 even as soon as it stabilizes on legacy platforms.

(Btw thanks for maintaining picom :wink:)

@chris42 Even though it is a property, no. Properties added to that variable end up in /vendor/build.prop. @yshui is referring to a property trigger, like this one to start sensors in Android 9/10:

(EDIT: Github doesn't seem to expand this permalink outside of the repository)

on property:vendor.qcom.slpiup=1
    enable vendor.sensors
chris42 commented 4 years ago

@MarijnS95 Thanks for answering even my noob questions.

Do you need regular logs or something special? Otherwise I can lookout in my current builds, if it comes up again. Although it is Android 10.

yshui commented 4 years ago

@MarijnS95 (this might be OT, but) I was stuck on Oreo because kernel 4.9 is quite unstable for me, with frequent freeze+restart and wifi connection problems. If Android 10 is good now, I totally will switch over.

As for the log, I don't have relevant logs currently, but from memory, I didn't see anything out of ordinary.

yshui commented 4 years ago

The fix I used is from here: https://github.com/sonyxperiadev/device-sony-common/pull/522, I believe.

This change is later reverted for some reason.

MarijnS95 commented 4 years ago

@yshui Understandable, though I don't remember the issues on 4.9 to be that bad back when I used Suzu. Unfortunately the issues are all hardware/kernel, nothing an Android version can change. Hopefully 4.14 or mainline brings more luck somewhere in the future.

I still want to see the logs, especially dmesg. What may not seem relevant (in terms of crashes/errors) to you will still be relevant to us; I'd like to see if the properties are even triggered in the first place.

The "fix" is not necessarily reverted as much as it has been moved over to different properties. It doesn't seem unlikely that a specific combination of encryption params is left out, hence never enabling/starting the service.

This raises a bunch of questions:

Combining that together, how about this patch? I'm not aware of any property representing post-fs-data, and something like on post-fs-data && property:sys.qcom.devup=1 will only execute when sys.qcom.devup happens to be equal to 1 while post-fs-data triggers are evaluated (and is completely missed otherwise, if the documentation is to be believed).

yshui commented 4 years ago

@MarijnS95 Thanks, I will try your patch when I get the chance to rebuild my rom.

MarijnS95 commented 4 years ago

@yshui FYI if rebuiling the rom is a major burden you can always copy the file to the device directly (vendor/etc/init/sensors.rc) or apply the patch on-device. Keep in mind that you'll have to undo the changes to init.common.rc if the previously linked patch is applied.

yshui commented 4 years ago

@MarijnS95 That patch doesn't seem to solve the problem.

dmesg.txt

MarijnS95 commented 4 years ago

@yshui on post-fs-data triggers well before the partition is mounted, apparently. This would cause the service to be enabled well before it should if it weren't for late_start starting services after the partition is mounted decrypted.

The main difference with https://github.com/sonyxperiadev/device-sony-common/pull/522 is that that patch starts the service as soon as devup=1 which happens pretty much instantaneously after boot. In turn leading me to believe that it is not depending on /data at all :thinking: The current code is anyway inconsistent, mixing enable and start.

Let's reiterate the problem:

  1. Is something actually broken?
  2. Are you merely seeing these "sensor is null" crashes from TransPower (loaded into com.android.phone) until the device is unlocked?

What I'm interested in now is the following:

The difference with https://github.com/sonyxperiadev/device-sony-common/pull/522 being that this enables the service to be started with class core rather than started instantaneously, and also restarted in the event of a crash.

yshui commented 4 years ago

Is something actually broken?

Yes, the phone app keeps crashing.

I will try the other things.

MarijnS95 commented 4 years ago

Is something actually broken?

Yes, the phone app keeps crashing.

Keeps crashing... All the time, or until the phone is unlocked and sensors become available?

If it keeps crashing that means sensors does not come up when started too late, even though it "needs to be started late" for /data to be available. Unfortunately these changes have been made before I joined this project; I don't fully know why that /data dependency suddenly got introduced, and why it's gone now.

yshui commented 4 years ago

@MarijnS95 All the time. BTW I am on FDE, not FBE.

What I'm interested in now is the following: [snip]

This seems to work fine. I don't see any warning related to sensors in logcat.

MarijnS95 commented 4 years ago

@MarijnS95 All the time. BTW I am on FDE, not FBE.

Yes, legacy devices "only" support FDE. On newer platforms with FBE (where I do not see /data/{misc,vendor}/sensors being used either) it is possible that these folders are encrypted with the hardware key only and thus available before the user enters their pincode.

This seems to work fine. I don't see any warning related to sensors in logcat.

Great, thanks for testing. I'll discuss whether it is safe to remove all the (bloated) encryption handling from Oreo. The problem here is that I don't want to break other platforms that may oddly use this /data partition...

yshui commented 4 years ago

Is something actually broken?

Ah, I forgot to mention, adaptive brightness also didn't work.

MarijnS95 commented 4 years ago

Ah, I forgot to mention, adaptive brightness also didn't work.

I wanted to ask that, but com.android.phone crashing is a clear indicator that it restarts and probes sensors again, rather than being stuck with missing sensors because it itself starts too early.

chris42 commented 4 years ago

@MarijnS95 So while trying to recreate the error with adsprpcd, I got some "sensor is null" errors... The "q-mr1-legacy" branch seems to be the only one to not have the change you suggested? Is it worthwhile to test the change? Error Log: 4th.txt

MarijnS95 commented 4 years ago

got some "sensor is null" errors

@chris42 Did they disappear after a while, or did the lack of sensors persist until the device was rebooted?

Q already has very similar changes, tailored to the multi-start system where adsp/cdsp/slpi are started separately. The created /data/vendor/sensors seems useless but is also harmless. So no, this patch for o-mr1 is "based off of" Q rather than the other way around: it cannot be applied on Q.

Allow me anyway to pick some interesting log lines:

1070:08-24 04:38:30.584     0     0 I init    : Parsing file /vendor/etc/init/android.hardware.sensors@1.0-service.rc...
1096:08-24 04:38:30.594     0     0 I init    : Parsing file /vendor/etc/init/sensors.rc...
1097:08-24 04:38:30.595     0     0 I init    : Parsing file /vendor/etc/init/slpistart.rc...
1574:08-24 04:38:31.660     0     0 I init    : processing action (post-fs-data) from (/vendor/etc/init/sensors.rc:1)
1638:08-24 04:38:31.787     0     0 I init    : starting service 'vendor.sensors-hal-1-0'...
1701:08-24 04:38:31.897   550   550 E libsensor1: check_sensors_enabled: Sensors enabled = true

1971:08-24 04:38:33.127     0     0 I init    : processing action (vendor.qcom.adspup=1) from (/vendor/etc/init/slpistart.rc:8)
1972:08-24 04:38:33.128     0     0 I init    : starting service 'slpistart_sh'...
1984:08-24 04:38:33.162     0     0 E sensors-ssc soc: qcom,msm-ssc-sensors: slpi_load_fw: SLPI image loading failed
1987:08-24 04:38:33.203     0     0 I init    : processing action (vendor.qcom.slpiup=1) from (/vendor/etc/init/sensors.rc:27)
1988:08-24 04:38:33.204     0     0 I init    : starting service 'vendor.sensors'...
1989:08-24 04:38:33.208     0     0 I init    : Service 'slpistart_sh' (pid 641) exited with status 0
2002:08-24 04:38:33.276   647   652 E Sensors : sns_reg_la.c(121):read error, err:5, attempted:52, read:0
2003:08-24 04:38:33.276   647   652 E Sensors : sns_reg.c(103):sns_reg_read failed; err 5

2465:08-24 04:38:36.732     0     0 I init    : processing action (post-fs-data) from (/vendor/etc/init/sensors.rc:1)
2571:08-24 05:38:36.903   550   550 E libsensor1: wait_for_service: Service 0 is not available (5000 ms)
2575:08-24 05:38:36.907   550   550 E qti_sensors_hal: sensor1 open failed!
2590:08-24 05:38:36.916   550   550 I ServiceManagement: Registered android.hardware.sensors@1.0::ISensors/default (start delay of 5133ms)
2592:08-24 05:38:36.925   550   550 I ServiceManagement: Removing namespace from process name android.hardware.sensors@1.0-service to sensors@1.0-service.
2597:08-24 05:38:36.927   550   550 I android.hardware.sensors@1.0-service: Registration complete for android.hardware.sensors@1.0::ISensors/default.
3562:03-30 20:11:30.426   534   706 E libsensor1: wait_for_service: Service 0 is not available (5000 ms)

And more, but missing SLPI firmware likely has all to do with it.

chris42 commented 4 years ago

The errors stopped after 1 or 2 minutes and the device is still running in this boot cycle without showing other issues. Hence they seem to go away. With the change I was targeting the

on property:sys.qcom.devup=1
    enable sensors

in the end. However if that is not needed, I am fine?! I have a hard time understanding how this works or needs to be, as there seems to be no publicly available information around it? (At least I could not find anything). Same with the SLPI firmware. Good to know that something seems to be missing, however the pages with howtos on SLPI firmware seem to be slim... ;-D

MarijnS95 commented 4 years ago

The errors stopped after 1 or 2 minutes and the device is still running

Are sensors actually working? It is known that transpower stops complaining about missing sensors after a while, but you should check with dumpsys sensorservice or a sensor test app. I doubt the SLPI firmware is magically found, but other log messages are similar to what I see on a working Nile/Ganges (which, as far as I'm aware, have a more similar sensor binary to Loire/Tone than the newer Tama/Kumano would).

However if that is not needed, I am fine?!

Yes, you are totally fine. o-mr1 boots all these subsystems synchronously and sets devup=1. On Q these all run in parallel (see links in previous post) and that variable doesn't exist anymore. It has been replaced with three separate ones, adspup/cdspup/slpiup. Slpiup is the one sensors needs to wait on, as is already done.

I have a hard time understanding how this works

There are various things tied into this system, which are complicated to explain (I'm by no means an expert). Furthermore these are device internals so I see little reason for manufacturers to (publicly) document how their system works. In the context of this issue we're looking at startup behaviour specifically. adspup.sh brings up the DSP, which in turn allows slpiup.sh to run and bring up the SLPI subsystem (which loads firmware etc). When SLPI is up the sensors are allowed to start. Nothing needs to be done here, all rules are already in place and properly functioning on all devices.

chris42 commented 4 years ago

O_o Nope sensors are not working on the test device after the "sensor is null" errors, "dumpsys sensorservice" gives a

No Sensors on the device
devInitCheck : 0

and using an app like Satstat gives no output of sensors.

After restarting the test device (and another crash reboot), I now get 34 sensors and readings in an app. So it seems the setup is working, however not very reliable.

chris42 commented 4 years ago

Checking the init setup you linked above, slpi waits for adsp to start, feedback is transmitted via vendor.qcom.adspup property But I would interpret

echo 1 > /sys/kernel/boot_adsp/boot
setprop vendor.qcom.adspup 1

as an asynchronous feedback to slpi. Hence if slpi needs adsp to have completed boot this won't do it, as the boot is triggered and then the property is changed. But it is not waited for the boot to finish. This could be an explanation on why this only shows up randomly. Possibly if little Kugo is busy with something else and adsp takes longer than usual, slpi fails. If so, maybe a little priority boost on adsp service would help? What do you think?

EDIT: Or of course I find a way to determine, if adsp really has completed

chris42 commented 4 years ago

Not sure about my timing theory, will test it either way.

However regarding the missing firmware, in every boot I get:

08-25 00:22:02.745     0     0 E         : can't get fw name.
08-25 00:22:02.745     0     0 E sensors-ssc soc: qcom,msm-ssc-sensors: slpi_load_fw: SLPI image loading failed

However the sensors are working. I could not find an obvious firmware or properties for slpi on the device. This seems to be handled in some other way or not play a role at all...

MarijnS95 commented 4 years ago

as an asynchronous feedback to slpi

You should should disable these services and try writing 1 manually from a shell. You'll notice this takes a while: the write to these boot attributes is blocking, hence adspup will not be set before the boot is complete.

Here's a random excerpt from one of my devices:

[    6.104716] init: starting service 'vendor.slpistart_sh'...
[    6.599902] init: Service 'vendor.slpistart_sh' (pid 686) exited with status 0
[    6.614484] init: processing action (vendor.qcom.slpiup=1) from (/vendor/etc/init/sensors.rc:27)

As you can see slpistart takes well over half a second to complete. Only after it completes is the property written, and the process exits. Now, yes this is a little different from your 4th.txt in that the property is obviously set before the script exits; which is why it processes the action barely before seeing the exited log.

However regarding the missing firmware, in every boot I get: However the sensors are working.

I am too unfamiliar with this part of the hardware; either the SLPI firmware is not necessary, or persisted in the hardware and thus only necessary if some EEPROM magically gets cleared/corrupted. I'll see if I can get some clarification.

chris42 commented 4 years ago

I started to check the whole manual thing following this start order:

  1. adsp echo 1 > /sys/kernel/boot_adsp/boot setprop vendor.qcom.adspup 1
  2. adsprpcd /odm/bin/adsprpcd
  3. slpistart echo 1 > /sys/kernel/boot_slpi/boot setprop vendor.qcom.slpiup 1
  4. sensors /odm/bin/sensors.qcom

While doing so, I always got an error on adsprpcd. So after a while I figured, that I always get an error (either way not accessible files or continuous restart of the daemon) when adb is connected. So I reverted everything manual and checked this during regular boot. And it is the same. If I connect adb early (so it is waiting for the device) adsprpcd produces errors. If I connect after the device came up, no problem. Not sure if this is interesting for anyone, however might have explained some of my troubles?

Regarding the SLPI firmware, that just stays the same and it is not found.

Also a minor: adsprpcd is started as media:media, the binary in odm/bin is root:shell (doen not matter as global x is set, however for other services this seems to be aligned)

MarijnS95 commented 4 years ago

While doing so, I always got an error on adsprpcd

That is to be expected, not all features supported by this binary are available or fully integrated. That does not mean it doesn't work at all, though. For example, this is what I see on a flawlessly working Bahamut (Xperia 5):

04-01 18:41:00.147  5535  5535 E adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/adsprpcd.c:29:adsp daemon starting
04-01 18:41:00.150  5535  5535 V adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/fastrpc_apps_user.c:1000: remote_handle_open: Successfully opened handle 0x7f for '":;./\attachguestos on domain 0
04-01 18:41:00.153  5535  5535 V adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/fastrpc_apps_user.c:1000: remote_handle_open: Successfully opened handle 0xf04b9f30 for adsp_default_listener on domain 0
04-01 18:41:00.155  5535  5535 V adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/fastrpc_apps_user.c:1000: remote_handle_open: Successfully opened handle 0x7 for '":;./\geteventfd on domain 0
04-01 18:41:00.159  5535  5536 E adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:729:Error 45: fopen failed for libadspmsgd_adsp_skel.so. (No such file or directory)
04-01 18:41:00.160  5535  5535 V adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/fastrpc_apps_user.c:1000: remote_handle_open: Successfully opened handle 0xf04ec910 for adspmsgd_adsp on domain 0

or continuous restart of the daemon

If however the logs keep looping, make sure the process isn't running twice (eg. stopped the vendor.adsprpcd service).

If I connect adb early (so it is waiting for the device) adsprpcd produces errors.

It is very unlikely for adb to have any sort of influence on adsprpcd. I won't be surprised if chatty has expired a bunch of lines, or if they are pushed out of the log buffer entirely due to other spam filling it up.

Regarding the SLPI firmware, that just stays the same and it is not found.

I'm not sure if this is necessary. If the firmware exists, it should be in /firmware or /odm/firmware, depending on the platform.

Also a minor: adsprpcd is started as media:media, the binary in odm/bin is root:shell (doen not matter as global x is set, however for other services this seems to be aligned)

File permissions determine who is allowed to access the files (in this case init starting adsprpcd), runtime permission determines as what user:group the executable is allowed to touch other files. adsprpcd likely has to open a bunch of devices/files that are available to media:media.

chris42 commented 4 years ago

The errors I get when I connect adb/usb are:

04-01 15:42:09.259  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for oemconfig.so. (No such file or directory)
04-01 15:42:09.261  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for remote_heap_config.so. (No such file or directory)
04-01 15:42:09.262  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for .//lowi//lowi_lp.conf. (No such file or directory)
04-01 15:42:09.265  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for testsig-0x558eac6.so. (No such file or directory)
04-01 15:42:09.266  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for testsig.so. (No such file or directory)
04-01 15:42:09.316  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for custom_sm_ecns_1.so.1. (No such file or directory)
04-01 15:42:09.318  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for custom_sm_ecns_2.so.1. (No such file or directory)
04-01 15:42:09.320  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for custom_dm_ecns_1.so.1. (No such file or directory)
04-01 15:42:09.321  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for custom_dm_ecns_2.so.1. (No such file or directory)
04-01 15:42:09.323  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for custom_qm_ecns_1.so.1. (No such file or directory)
04-01 15:42:09.324  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for custom_qm_ecns_2.so.1. (No such file or directory)
04-01 15:42:09.326  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for voiceproc_tx.so. (No such file or directory)
04-01 15:42:09.328  7019  7021 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/apps_std_imp.c:728:Error 45: fopen failed for voiceproc_rx.so. (No such file or directory)

And even got this, but then the device at some point is not responding anymore and loops always back to this error:

04-01 18:15:14.864   743   743 D /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/fastrpc_apps_user.c:1638: Error 0xffffffff: apps_dev_init failed. domain 0, errno Operation not permitted
04-01 18:15:14.865   743   743 D /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/fastrpc_apps_user.c:1722: Error ffffffff: open dev -1 for domain 0 failed
04-01 18:15:14.865   743   743 D /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/fastrpc_apps_user.c:858: Error 2c: remote handle invoke failed. domain 0, handle ffffffff, sc 0, pra 0x0
04-01 18:15:14.867   743   743 E /odm/bin/adsprpcd: vendor/qcom/proprietary/commonsys-intf/adsprpc/src/adsprpcd.c:43:adsp daemon will restart after 25ms...

In comparison when not having adb/usb connected:

08-26 03:40:07.543     0     0 I init    : starting service 'vendor.adsprpcd'...
08-26 03:40:07.561     0     0 I init    : starting service 'vendor.media.omx'...
08-26 03:40:07.588     0     0 I init    : starting service 'vendor.ipacm'...
08-26 03:40:07.607     0     0 I init    : starting service 'vendor.irsc_util'...
08-26 03:40:07.636     0     0 I init    : starting service 'macaddrsetup'...
08-26 03:40:07.666     0     0 I init    : starting service 'vendor.netmgrd'...
08-26 03:40:07.708     0     0 I init    : starting service 'vendor.qmuxd'...
08-26 03:40:07.719     0     0 I init    : starting service 'ril-daemon'...
08-26 03:40:07.729     0     0 I init    : starting service 'media.swcodec'...
08-26 03:40:07.741     0     0 I init    : Command 'class_start main' action=vold.decrypt=trigger_restart_framework (/init.rc:787) took 556ms and succeeded
08-26 03:40:07.751     0     0 I init    : starting service 'MfocAGoERwr'...
08-26 03:40:07.766     0     0 I init    : starting service 'gatekeeperd'...
08-26 03:40:07.785     0     0 I init    : starting service 'tombstoned'...
08-26 03:40:07.813     0     0 I init    : starting service 'usbd'...
08-26 03:40:07.829     0     0 I         : 'opened /dev/adsprpc-smd c 226 0'

I think I tried different boots ~10-15 times and it was consistent. Nevertheless, I think I used up your time enough already. Thanks a lot for all! Probably should just avoid the usb stuff during boot. Will hope that my kugo "with personality" will survive some time until I find a good replacement.