topjohnwu / Magisk

The Magic Mask for Android
GNU General Public License v3.0
47.09k stars 11.91k forks source link

Boot delay of 60s caused by Magisk #7511

Open danielm234 opened 10 months ago

danielm234 commented 10 months ago

[ 70.503041]I[6: swapper/6: 0] sd 0:0:0:3: [sdd] tag#2 timing out command, waited 60s [ 70.503083]I[6: swapper/6: 0] sd 0:0:0:3: [sdd] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=60s [ 70.503098]I[6: swapper/6: 0] sd 0:0:0:3: [sdd] tag#2 Sense Key : 0x7 [current] [ 70.503110]I[6: swapper/6: 0] sd 0:0:0:3: [sdd] tag#2 ASC=0x27 ASCQ=0x1 [ 70.503124]I[6: swapper/6: 0] sd 0:0:0:3: [sdd] tag#2 CDB: opcode=0x2a 2a 00 00 00 00 28 00 00 01 00 [ 70.503138]I[6: swapper/6: 0] blk_update_request: I/O error, dev sdd, sector 320 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0 [ 70.503153]I[6: swapper/6: 0] Buffer I/O error on dev sdd1, logical block 34, lost async page write [ 70.503395] [6: init: 1] JBD2: recovery failed [ 70.503420] [6: init: 1] EXT4-fs (sdd1): error loading journal

last_kmsg_without_magisk.log
last_kmsg_magisk_patched.log

Device: a53x Android version: Android 13 Magisk version name: Magisk v26.4 Magisk version code: 26401 (568c5187)

Update, new logs with the latest magisk debug version: Device: a53x Android version: Android 13 Magisk version name: Magisk debug latest Magisk version code: 26401 (568c5187)

magisk_log_2023-11-15T20.15.34.log proc_mounts.log dmesg.log last_kmsg.log logcat_a53_android_13_latest_debug_magisk.log

mark7717 commented 10 months ago

Good catch! I was experiencing the same issue, but couldn't find what it was. Hope it gets fixed soon.

yashikada commented 10 months ago

Please try the latest build and see whether it'll solve your problem or not. https://github.com/topjohnwu/Magisk/actions/workflows/build.yml

it test already the latest action. Are you a bot? You write always the same message.

Unknown78 commented 10 months ago

Sorry, I misread the version code. No, just a normal volunteer in a free time. The latest build was able to solve my issue, so I hope others too.

mir4452 commented 10 months ago

I can confirm same issue. I've tested it on Android 13, Magisk v24.3, A53 device.

danielm234 commented 10 months ago

It seems related to Android 13, as I've not been able to reproduce the issue on Android 12.
Same device

yujincheng08 commented 10 months ago
[   70.505086] [6:           init:    1] init: Service 'exec 18 (/system/bin/recovery-persist)' (pid 738) exited with status 0 oneshot service took 60.033001 seconds in background
[   70.505104] [6:           init:    1] init: Sending signal 9 to service 'exec 18 (/system/bin/recovery-persist)' (pid 738) process group...
[   70.505284] [6:           init:    1] libprocessgroup: Successfully killed process cgroup uid 1000 pid 738 in 0ms
[   70.505500] [6:           init:    1] init: Service 'exec 19 (/system/bin/sh -c profman \
[   70.505508] [6:           init:    1] init:         --generate-test-profile=/data/misc/speg/test.prof --generate-test-profile-num-dex=0 && \
[   70.505513] [6:           init:    1] init:         test `head -c 8 /data/misc/speg/test.prof | tr -d '\0'` == `/apex/com.samsung.android.spqr/bin/spqr --profile-version || \
[   70.505519] [6:           init:    1] init:         echo bad` || setprop com.samsung.speg.disable true; rm /data/misc/speg/test.prof)' (pid 739) exited with status 0 oneshot service took 60.030998 seconds in background
[   70.505530] [6:           init:    1] init: Sending signal 9 to service 'exec 19 (/system/bin/sh -c profman \
[   70.505535] [6:           init:    1] init:         --generate-test-profile=/data/misc/speg/test.prof --generate-test-profile-num-dex=0 && \
[   70.505540] [6:           init:    1] init:         test `head -c 8 /data/misc/speg/test.prof | tr -d '\0'` == `/apex/com.samsung.android.spqr/bin/spqr --profile-version || \
[   70.505545] [6:           init:    1] init:         echo bad` || setprop com.samsung.speg.disable true; rm /data/misc/speg/test.prof)' (pid 739) process group...
yujincheng08 commented 10 months ago
  1. You are not using debug Magisk; I cannot see any debug log in your kmsg.
  2. Are you sure it won't block for 60s without Magisk? Looks like it's caused by a data corruption, not by Magisk.
TotallyAvailable commented 10 months ago

3 seperate instances of Samsung specific Android 13 related Data corruption on a specific device line, exposed/called out by chance due to not providing debug level logs ?

Given that I'm reading v24.3 being affected as well, you might not even need detailed logs (besides a case specific work around?), that might just be spot on depending on when it started ? Might be interesting IF that's the case.

Just me being curious and of no use solving this one of course but... @mir4452 v24.3 ? What timeline of discovery are we talking about here ?

(If that's too off topic I'll shut up)

Unknown78 commented 10 months ago

Given that it may affect across multiple versions and carriers/models, which may have different ROM or kernel or partition table, or others. Logs needs to be detailed as possible to get the proper solutions.

@TotallyAvailable @yashikada @mark7717 @mir4452 @danielm234 Please follow his guide and be patience.

Let me explain it if anyone need a guide to log. Please reinstall Magisk, both App and kernel/boot with app-debug.apk. Within Magisk Build workflows, there's the currently latest artifact, download it and extract the .zip to get the .apk. It has different signature, so uninstall any existing Magisk App beforehand.

On the Desktop, with the Phone USB Drivers installed and ADB from Android SDK Platform Tools. Also on the Phone, with USB Debugging enabled and authorized/allowed the desktop, we're able to obtain the logs:

  1. Turn off the phone.
  2. On the desktop. open terminal/shell/cmd
  3. Change directory to the extracted platform-tools folder.
  4. Enter adb logcat > logcat.log
  5. Connect the phone to desktop
  6. Turn on the phone, wait till home shows up + 60 seconds,
  7. If you're able to open Magisk App, then go to Logs, tap the diskette icon. Get magisk_log_dateTtime.log in your internal storage Download folder.
  8. Back to desktop, press CTRL+C.
  9. If earlier you're not able to open Magisk App, enter adb shell su -c cat /cache/magisk.log > magisk.log, otherwise skip this step.
  10. Enter adb shell su -c dmesg > dmesg.log
  11. Enter adb shell su -c cat /proc/mounts > proc_mounts.log
  12. Restart the phone, wait till power on, and then enter adb wait-for-device shell su -c cat /proc/last_kmsg > last_kmsg.log.

Attach all the log files here.

P.S. Please correct me if I miss something.

danielm234 commented 10 months ago

@Unknown78 , @topjohnwu here are all the logs: Device: a53x Latest debug magisk version for kernel and app

last_kmsg.log logcat_a53_android_13_latest_debug_magisk.log magisk_log_2023-11-15T20.15.34.log proc_mounts.log dmesg.log

danielm234 commented 10 months ago
  1. You are not using debug Magisk; I cannot see any debug log in your kmsg.
  2. Are you sure it won't block for 60s without Magisk? Looks like it's caused by a data corruption, not by Magisk.
  1. Sent new logs
  2. @yujincheng08 , yes, I'm sure 100%, I was flashing a boot image without magisk and it works fine, as soon as I flash magisk patched boot, it produces 60 sec delay, and again when I flash non-patched boot image, the delay dissappears.

If needed, we can arrange a TeamViewer session, so I can show you.

TotallyAvailable commented 10 months ago

Now just imagine if the bot, instead of just turning people away with a message that even IF (big if) people follow the advice of only trusting stuff within this repo wouldn't get them very far, would instead link to a nicely written and understandable guide starting with something as basic as LITERALLY just linking to the legit debug APK they are being told to use...even better, start with the Canary as simple upgrade and see if that's all it takes before telling them to do stuff they were told not to do, ever.

(If you've been out there searching for literally anything Magisk related you'll know what I'm after.)

Even assuming you, by chance end up in the right repo..they might even look into that .json or of course go right for the Canary branch like the guide told them to, to avoid instant bot closure...wait a minute. You know how it works, I do, the devs know. For many Magisk users the journey literally ends at the last dot of the official Guide and with a working Device back on the Home Screen. People can, if you give them a bit of guidance. Shit breaks ? Just look at the history of repeated comment-less closed Issues. And those that were never opened.

I absolutely get the bot solution, seen how it ends without often enough. If you don't want to repeat everything every time...the bot is already there, so are people with the knowledge for a nice, welcoming "first aid" guide that could be part of what the bot throws at the people desperately looking for guidance, at the place you WANT them to be.

Want fast & effective problem solving? Don't hope that someone at some point jumps in and takes the time to ask the right questions and provides the needed guidance for the 100th time to get the logs fast enough before the post ends up closed in the graveyard. Bot -> Guide -> Definitely getting those Logs -> actual fixes -> less silent suffering. 'no log/no fix' literally doesn't make the issue go away is what I'm trying to point out.

Bit of a rant that'll get labeled as off topic (rightfully so, honestly) (or even removed) anyway, so be it.

Given the power Magisk holds, a "not my issue if you drown" is not what I want to believe in or stand behind. Not that I'm of any significance anyway of course.

Referring to official statements (like the v25 related changes) and the "forever" open source promise you'd think...that the official place shouldn't be the last one you'd go looking for advice. Those totally official!!1! sites might even offer you a warm coffee, gladly helping you out with that juicy 'allsolving-signature-mismatched' "debug" build. ☠️

If anyone still ends up reading this by chance, have a nice day, no need to even agree with any of what I just said. Also NOT targeted at anyone specific, feel free to see it as suggestion or feedback or just nuke it of your repo, your decision.

(How this rant came to be ? @Unknown78 jumping in, saving this Issue from closure, like you know...that other one we just talked about.)

danielm234 commented 10 months ago

Unfortunately, my issue was closed two times before this; I was quite close to giving up. But Magisk is such an amazing tool, and I appreciate all your hard work.

mir4452 commented 10 months ago

Given that I'm reading v24.3 being affected as well, you might not even need detailed logs (besides a case specific work around?), that might just be spot on depending on when it started ? Might be interesting IF that's the case.

Just tested it on Magisk v24.0 / Android 13, delay of 60 seconds still reproduces. But on Android 12 there is no issue, it happens only on Android 13

If someone can provide the Magisk v24 debug apk, I can provide correct logs

TotallyAvailable commented 10 months ago

Wasn't at all meant as "don't bother nothing to see here" just the chance of overlap including a version that old, seemed interesting. Magisk tapping into something Samsung broke (A12->A13) and didn't bother fixing as it won't be used anyway ? Important bits being dropped during setup ? Still resulting in successful boot...

Part of my rant above, the wall you need to climb to report a "non catastrophic failure" instead of just dealing with it until someone else reports it.

Just hop in, read a short guide, drop a log (ideally without fking with signatures) and take the burden away from people that do not possess the knowledge to figure out what's wrong, don't know that it's not supposed to be like this, don't want to create a (site-specific) account or install something that might just break their daily driver for (in the end potentially) nothing.

As for legit v24 Debug https://github.com/topjohnwu/magisk-files/commit/74c558ef781408c81af9628d6735ffbba2294189 is a old as it gets. How things were handled before ? Beyond my knowledge (and time of Magisk usage)

Not even telling you to go for it and risk breaking whatever surprise this build might come with, haven't even checked if the file still exists.

To round things off why I won't, while linking the literal verifiably original commit ? And the reference to v25 ? I'll be that guy and https://github.com/topjohnwu/Magisk/issues/7413#issuecomment-1759557732 , https://github.com/topjohnwu/Magisk/releases/tag/v25.0. If you (anyone reading this) weren't around at the time (and please ignore the bot closing the issue) I highly recommend keeping those things of CRITICAL importance in mind, always. Think first, verify where you are, proceed if legit.

Unknown78 commented 10 months ago

In my past issue of https://github.com/topjohnwu/Magisk/issues/6257, the stable build causes stuck at App logo, while the debug enables it to not. Some irl happened and I wasn't able to provide the logs at the time and the issue was closed as not enough info. But after a few months later after I have some free time, the issue was reopened, by trying requested builds and providing logs, the issue was found out to be because of old kernel, https://github.com/topjohnwu/Magisk/issues/6257#issuecomment-1368967265, and fix was able to be found. Just be patience. If you are that paranoid, you could do full nandroid backup, wipe or factory reset, remove the sim card and external sd card, turn off every networks i.e. wifi/bluetooth, fake/mock your gps location, backup and wipe/corrupt your efs/imei, and then try the build. It is possible if you need that much security..

TotallyAvailable commented 10 months ago

Don't get used to someone telling you to "just install this one build from somewhere out of plain sight, follow these specific steps and ignore anything else" is what the reminder is about.

We are here at the right and (I hope) trustable place, the build mentioned above is legit.

Just like many of the commits to Magisk are far beyond my understanding - that's where Trust comes in. Not in a malicious way, (for me personally) in a "won't break my device when I least need it". I do like to ride the bleeding edge and have avoided complete disaster so far...what I intended to say is...now there's the groundwork done on how to do it.

The person you just explained it to (or someone that just followed along) trusts a given site (that is not GitHub) more than you do ? maybe even tried it here first again, but the bot closed the issue before someone helped out ? "Follow these steps and ignore anything else just like last time". If those Instructions look familiar and the device doesn't immediately combust after the last step, and actually even works again, how likely is a "I'll take that" with no further thoughts given. (Bonus points if you managed to sneak in CorePatch at some point along the way)

(That's really gonna be it from me for now. Don't want to test the limits of (off topic) tolerance. If you've made it this far... no 🏆 for you unfortunately 🥺)

danielm234 commented 10 months ago

Update @topjohnwu @yujincheng08 @TotallyAvailable @Unknown78

[ 70.257910] [6: init: 1] init: Command 'mount ext4 /dev/block/by-name/cpefs /mnt/vendor/cpefs nosuid nodev noatime noauto_da_alloc,discard,journal_checksum,data=ordered,errors=panic' action=post-fs-data (/vendor/etc/init/init.baseband.rc:13) took 60007ms and failed: mount() failed: I/O error

adb shell su mount -o rw,remount /vendor rm /vendor/etc/init/init.baseband.rc

I've removed the above file, and there are no more delays on the boot. Of course, this is not a fix, but at least it points to a possible problem.

init.baseband.rc is exactly same file on Android 12 and Android 13, but on android 13 with Magisk it causes 60 sec delay init.baseband.rc.log

danielm234 commented 10 months ago

Since in /vendor/etc/fstab.s5e8825 there is already a following line

/dev/block/by-name/cpefs /mnt/vendor/cpefs ext4 noatime,nosuid,nodev,noauto_da_alloc,discard,journal_checksum,data=ordered,errors=panic wait,check,nofail

Is it really necessary to have a similar line in init.baseband.rc (because the bottom line makes a 60s delay with magisk)

mkdir /mnt/vendor/cpefs 0771 radio system
mount ext4 /dev/block/by-name/cpefs /mnt/vendor/cpefs nosuid nodev noatime noauto_da_alloc,discard,journal_checksum,data=ordered,errors=panic
TotallyAvailable commented 10 months ago

One could argue that the line (not) being there doesn't matter, assuming you've used the device before, just dealing with the delay.

Who put it there ? Why is it there ? The list would go on, except I just hit Google with "Android cpefs" ... interesting.

danielm234 commented 10 months ago

Device wasn't used before, its stock android without any changes,and with/without magisk patch

Output when trying to execute this command: mount -t ext4 /dev/block/by-name/cpefs /mnt/vendor/cpefs -o nosuid,nodev,noatime,noauto_da_alloc,discard,journal_checksum,data=ordered,errors=panic

'/dev/block/by-name/cpefs' is read-only 60seconds delay happens mount: '/dev/block/by-name/cpefs'->'/mnt/vendor/cpefs': I/O error

If I flash the boot image without Magisk, this error won't happen.

Update: On android 12 /dev/block/by-name/cpefs is RW on android 13 /dev/block/by-name/cpefs is RO

TotallyAvailable commented 10 months ago

If it wasn't for the fact that it appears on v24.3 as well, the narrowing down would've been somewhat easier. The Device isn't brand new either. A long known (in Custom Recovery Circles) Samsung related issue popping into existence here. Android 13 has apparently been out for the A53 line for a while as well, so

Output when trying to execute this command: mount -t ext4 /dev/block/by-name/cpefs /mnt/vendor/cpefs -o nosuid,nodev,noatime,noauto_da_alloc,discard,journal_checksum,data=ordered,errors=panic

Just a thought - what if you add " wait,check,nofail" to the end of that command like in the fstab entry, assuming that one of the 2 has to actually work.

Maybe even add it to the baseband.rc entry instead of deleting that line completely and reboot...what could possibly go wrong here.

And the RO only seems given: took 60007ms and failed: mount() What changes, that shouldn't be there. And been in discussions since A8. (I'm far beyond my expertise anyway but sometimes you'll fall over the answer, walking through the dark.)

You know the feeling missing the most obvious? "Old" Magisk Version "Old" Device "Old" OS Actual old issue Why now It's not going to be the November Security Patch is it ¯\_(ツ)_/¯

(Totally not this Issue popping up in Google while searching for "jbd2 recovery failed", I'm sure I'll help myself with that one) Actual interesting read while waiting: https://android.googlesource.com/kernel/msm/+/android-msm-3.9-usb-and-mmc-hacks/fs/jbd2/journal.c Quite old but with those lovely comments as guidance you'll make your way

Anyway probably time to just wait for someone with the required knowledge.

danielm234 commented 10 months ago

Just a thought - what if you add " wait,check,nofail" to the end of that command like in the fstab entry, assuming that one of the 2 has to actually work.

Maybe even add it to the baseband.rc entry instead of deleting that line completely and reboot...what could possibly go wrong here.

This actually removes 60s delay, but the cpefs mount still doesn't work, and it is not accessible in /mnt/vendor/cpefs

It's not going to be the November Security Patch is it ¯_(ツ)_/¯

Just tried the older Android 13 version, same behaviour, so its not latest security patch

TotallyAvailable commented 10 months ago

While progress, there's obviously more going on. The wrong recent Security Patch guess doesn't help either. While still left with the "Why now" question, my lack of insight into the Samsung specific rooting stuff certainly won't carry this discussion any further I feel. While the errors in the logs are clearly visible, looking at the stuff my fully functional device spits out...pretty ugly as well.

I still want that one more step though, just piecing together what was already mentioned. Ignoring the "same file in A12&13", going 1 step back

mkdir /mnt/vendor/cpefs 0771 radio system

(This bit right here. As A13 returns -> read only, 60s stuck until timeout, returning mount() fail... Can't confirm a successful mount if the expected writing check to it fails ?)

Still doesn't answer the initial question. And while I highly doubt it looking at a certain init.rc file and first_stage_init/mount (never look at the most obvious stuff first), that whole /fs_mgr/ directory sure got me (personally) a bit deeper into it, driven by curiosity and the earlier mentioned "falling over the answer...", so maybe next time. (As actual final edit on this one - it's not even the lack of familiarity and actual understanding with the codebases involved, there are plenty of "trigger points", without a similar enough device on hand though, massive skill point decrease. Access at the most out of the blue moment when it hit me ? Instant solve, maybe...if that's your way of making it through life as well, you'll know the feeling. Also as off topic as it will get. Yes I know how discussions here usually progress towards a solved Issue.)

// If relative_timeout is std::chrono::milliseconds::max(), then the wait will block indefinitely.

Can apparently always get worse than "just" 60s And we haven't even touched the whole "/apex/com.samsung.android.spqr/bin/spqr --profile-version" & calling for rm on /data/misc/speg/test.pro, hints.

And with no one else jumping in on possible causes or solutions they found... that's not really something I'd like to do either, knowing how much it sucks to see stuff potentially end without a (quick) solution. Given the right Version to avoid bot closure + debug logs in the first post, there's obviously hope.

danielm234 commented 10 months ago

mkdir /mnt/vendor/cpefs 0771 radio system mount ext4 /dev/block/by-name/cpefs /mnt/vendor/cpefs nosuid nodev noatime noauto_da_alloc,discard,journal_checksum,data=ordered,errors=panic

Should this 'cpefs' mount be visible in /proc/mounts? I cannot find it in /proc/mounts with a patched boot image, and not even with a non-patched boot image.

This could mean that the following line can be easily removed from init.baseband.rc ?

mount ext4 /dev/block/by-name/cpefs /mnt/vendor/cpefs nosuid nodev noatime noauto_da_alloc,discard,journal_checksum,data=ordered,errors=panic

cpefs doesn't ever get mounted on a stock OS, with stock boot. With magisk patched boot it makes the boot delay of 60s, it seems that it can deleted from init.baseband.rc to avoid 60s delay.

TotallyAvailable commented 10 months ago

So It works on A12 Given that the file has apparently been the same since atleast A12 A13 and the device have been around for a while Known OEM and environment The involved Magisks versions not brand new either Adding those arguments to the end of the command "works"

Where are all the other users with similar problems ? Does the device actually work, including all that radio related stuff ? Why the 60s delay and only once ?

There's the whole thought about "stuff no longer available when rooted" playing into it of course, given that you come across vbmeta, dm-verity and selinux pretty early on. Which parts of the command might be removed to not end with a recovery or journal error, does it even matter... probably because someone left it there...

Skizzels commented 4 months ago

Flashing+Instructions+for+MAC.pdf build.gradle.txt managing-github-actions-settings-for-a-repository.md

Skizzels commented 4 months ago

[ 70.503041]I[6: swapper/6: 0] sd 0:0:0:3: [sdd] tag#2 timing out command, waited 60s [ 70.503083]I[6: swapper/6: 0] sd 0:0:0:3: [sdd] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=60s [ 70.503098]I[6: swapper/6: 0] sd 0:0:0:3: [sdd] tag#2 Sense Key : 0x7 [current] [ 70.503110]I[6: swapper/6: 0] sd 0:0:0:3: [sdd] tag#2 ASC=0x27 ASCQ=0x1 [ 70.503124]I[6: swapper/6: 0] sd 0:0:0:3: [sdd] tag#2 CDB: opcode=0x2a 2a 00 00 00 00 28 00 00 01 00 [ 70.503138]I[6: swapper/6: 0] blk_update_request: I/O error, dev sdd, sector 320 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0 [ 70.503153]I[6: swapper/6: 0] Buffer I/O error on dev sdd1, logical block 34, lost async page write [ 70.503395] [6: init: 1] JBD2: recovery failed [ 70.503420] [6: init: 1] EXT4-fs (sdd1): error loading journal

last_kmsg_without_magisk.log last_kmsg_magisk_patched.log

Device: a53x Android version: Android 13 Magisk version name: Magisk v26.4 Magisk version code: 26401 (568c5187)

Update, new logs with the latest magisk debug version: Device: a53x Android version: Android 13 Magisk version name: Magisk debug latest Magisk version code: 26401 (568c5187)

magisk_log_2023-11-15T20.15.34.log proc_mounts.log dmesg.log last_kmsg.log logcat_a53_android_13_latest_debug_magisk.log

Yes