rpmpd(and maybe qcom-cpufreq-nvmem?) are broken on sdm632 - Githubissues

msm8953-mainline / linux

Linux mainline kernel with WIP patches for msm8953 devices

Other

111 stars 59 forks source link

rpmpd(and maybe qcom-cpufreq-nvmem?) are broken on sdm632 #44

Closed ultra-azu closed 5 months ago

ultra-azu commented 2 years ago

So @z3ntu and I noticed 5.18 didn't work on our sdm632 devices. We went and compared logs and we found that we had these lines in common in our stack traces:

4>[^@   0.482967] Call trace:
<4>[    0.482973]  of_get_next_available_child+0x98/0xc0
<4>[    0&482984]  _of_add_table_indexed+0x5f0/0x8e4
<4>[    0.482995]  dev_pm_opp_of_add_table_in$exed+0p14/0x20

However z3ntu at first got that from dsi and adreno drivers, while I did with rpmpd. However eventually we find out that disabling CPU OPPs seemed to get us to initramfs(although very slowly) at least. I change match_data in cpufreq_nvmem to kryo because of an oversight from my side that thought it would fix it. And it kinda did: cpufreq-nvmem didn't probe successfully but otherwise my ocean was usable in weston. However I noticed an error about "Not Snapdragon 820/821!" on my log so I reverted and decided to look deeper.

I noticed a line <3>[ 0.196112] cpu cpu0: Failed to add OPP name speed6-pvs0-v0. I later find out it's erroring out with -EPROBE_DEFER. A dependency with rpmpd perhaps? Would explain why it fails just before rpmpd panics. And yep, making rpmpd to sleep for 10 seconds made nvmem to probe deferral timeout, confirming my theory.

I went and look from the rpmpd side, and it seems the problem starts here. I did some printk debugging and it eventually led to here. But after that I can't get to figure it out. It doesn't help that that function is used quite a lot so I can't just pr_err away because my log gets filled out and slows the phone down.

Anyway, I also can't get the phone to boot fine again without nvmem(it hangs on initfs). I may have modified something else when I tried the kryo_match_table in nvmem and now I can't figure out what. I also think it's possible disabling CPU OPP works because the phone is too slow to get to rpmpd in the first place.

EDIT: @alikates Said on Matrix that this affects his sdm625 too. Maybe it isn't sdm632 exclusive? Full ramoops: ramoops.txt

alikates commented 2 years ago

While investigating the issue I found a code path that could lead to a null pointer dereference or similar in _add_opp_table_indexed. Maybe that's why we are having that kernel oops.

struct opp_table *_add_opp_table_indexed(struct device *dev, int index,
                     bool getclk)
{

...

    if (opp_table) {
        if (!_add_opp_dev(dev, opp_table)) {
            dev_pm_opp_put_opp_table(opp_table);
            opp_table = ERR_PTR(-ENOMEM);               <----
        }
            mutex_lock(&opp_table_lock);
    } else {
        opp_table = _allocate_opp_table(dev, index);

        mutex_lock(&opp_table_lock);
        if (!IS_ERR(opp_table))
            list_add(&opp_table->node, &opp_tables);
    }

    opp_tables_busy = false;

unlock:
    mutex_unlock(&opp_table_lock);

    return _update_opp_table_clk(dev, opp_table, getclk);     <---- (this functions tries to access a struct field of opp_table without checking if the pointer is invalid)
}

alikates commented 2 years ago

Nevermind, i didn't see the IS_ERR() in _update_opp_table_clk

alikates commented 2 years ago

Now i got something

/mnt/linux # cat trace.txt | CROSS_COMPILE=aarch64-alpine-linux-musl- ./scripts/decode_stacktrace.sh .output/vmlinux ./
[    1.618458] Call trace:
[    1.618461] _of_add_table_indexed (/mnt/linux/.output/../drivers/opp/of.c:373 /mnt/linux/.output/../drivers/opp/of.c:1053 /mnt/linux/.output/../drivers/opp/of.c:1148)
[    1.618472] dev_pm_opp_of_add_table_indexed (/mnt/linux/.output/../drivers/opp/of.c:1235) 
[    1.618481] of_genpd_add_provider_onecell (/mnt/linux/.output/../drivers/base/power/domain.c:2380)

Here's the issue: https://github.com/msm8953-mainline/linux/blob/b265c0627c38629719e2f472755b5a9b32603fb2/drivers/opp/of.c#L369-L375 The size of the array that's being indexed and required_opp_count don't match because when initializing the opp_table if there's no required_tables it doesn't set the value of required_opp_count.

Here's the code (put_np is at the end of the function): https://github.com/msm8953-mainline/linux/blob/b265c0627c38629719e2f472755b5a9b32603fb2/drivers/opp/of.c#L172-L183

alikates commented 2 years ago

Lol I found what the error was...

https://github.com/msm8953-mainline/linux/blob/b265c0627c38629719e2f472755b5a9b32603fb2/arch/arm64/boot/dts/qcom/msm8953.dtsi#L360-L362

Compatible has to be operating-points-v2. No wonder it was reading a null opp table...

I just tested it and its working...

alikates commented 2 years ago

Can you test changing the compatible and see if that's it?

z3ntu commented 2 years ago

According to docs, standalone operating-points-v2-kryo-cpu compatible is a legit use case (and also used on msm8996 & qcs404, but operating-points-v2-qcom-cpu isn't documented anywhere. Removing the latter doesn't change anything though.

Are you sure that the cpufreq driver probes correctly when you set compatible to operating-points-v2?

alikates commented 2 years ago

Okay right, it's failing with -ENOENT. But at least now it doesn't crash... [ 1.296503] qcom-cpufreq-nvmem: probe of qcom-cpufreq-nvmem failed with error -2

Maybe something's missing in the DT?

z3ntu commented 2 years ago

Yes, also commenting the nvmem-cells = <&cpu_speed_bin>; in dt makes probe fail and boot succeed.

I guess the bug lies in either b052d04f8c0f0860c292386be5d13bef664eefa6 (which looks quite hacky tbh and looks like it might cause stuff like this) or 664d2efa68fe1de0772d4acc253503de83543c87 and presumably doing things that shouldn't be done.

z3ntu commented 2 years ago

Temporarily worked around with https://github.com/msm8953-mainline/linux/commit/b265c0627c38629719e2f472755b5a9b32603fb2, I believe

alikates commented 1 year ago

Fixed in 6.0.10 branch, the compatible for the spmi regulator that cpr uses was wrong so it didn't probe

alikates commented 1 year ago

Can you test it in both sdm625 and 632?

alikates commented 1 year ago

Is this still relevant?