Closed ultra-azu closed 5 months ago
While investigating the issue I found a code path that could lead to a null pointer dereference or similar in _add_opp_table_indexed
. Maybe that's why we are having that kernel oops.
struct opp_table *_add_opp_table_indexed(struct device *dev, int index,
bool getclk)
{
...
if (opp_table) {
if (!_add_opp_dev(dev, opp_table)) {
dev_pm_opp_put_opp_table(opp_table);
opp_table = ERR_PTR(-ENOMEM); <----
}
mutex_lock(&opp_table_lock);
} else {
opp_table = _allocate_opp_table(dev, index);
mutex_lock(&opp_table_lock);
if (!IS_ERR(opp_table))
list_add(&opp_table->node, &opp_tables);
}
opp_tables_busy = false;
unlock:
mutex_unlock(&opp_table_lock);
return _update_opp_table_clk(dev, opp_table, getclk); <---- (this functions tries to access a struct field of opp_table without checking if the pointer is invalid)
}
Nevermind, i didn't see the IS_ERR() in _update_opp_table_clk
Now i got something
/mnt/linux # cat trace.txt | CROSS_COMPILE=aarch64-alpine-linux-musl- ./scripts/decode_stacktrace.sh .output/vmlinux ./
[ 1.618458] Call trace:
[ 1.618461] _of_add_table_indexed (/mnt/linux/.output/../drivers/opp/of.c:373 /mnt/linux/.output/../drivers/opp/of.c:1053 /mnt/linux/.output/../drivers/opp/of.c:1148)
[ 1.618472] dev_pm_opp_of_add_table_indexed (/mnt/linux/.output/../drivers/opp/of.c:1235)
[ 1.618481] of_genpd_add_provider_onecell (/mnt/linux/.output/../drivers/base/power/domain.c:2380)
Here's the issue: https://github.com/msm8953-mainline/linux/blob/b265c0627c38629719e2f472755b5a9b32603fb2/drivers/opp/of.c#L369-L375 The size of the array that's being indexed and required_opp_count don't match because when initializing the opp_table if there's no required_tables it doesn't set the value of required_opp_count.
Here's the code (put_np is at the end of the function): https://github.com/msm8953-mainline/linux/blob/b265c0627c38629719e2f472755b5a9b32603fb2/drivers/opp/of.c#L172-L183
Lol I found what the error was...
Compatible has to be operating-points-v2
. No wonder it was reading a null opp table...
I just tested it and its working...
Can you test changing the compatible and see if that's it?
According to docs, standalone operating-points-v2-kryo-cpu
compatible is a legit use case (and also used on msm8996 & qcs404, but operating-points-v2-qcom-cpu
isn't documented anywhere. Removing the latter doesn't change anything though.
Are you sure that the cpufreq driver probes correctly when you set compatible to operating-points-v2
?
Okay right, it's failing with -ENOENT. But at least now it doesn't crash...
[ 1.296503] qcom-cpufreq-nvmem: probe of qcom-cpufreq-nvmem failed with error -2
Maybe something's missing in the DT?
Yes, also commenting the nvmem-cells = <&cpu_speed_bin>;
in dt makes probe fail and boot succeed.
I guess the bug lies in either b052d04f8c0f0860c292386be5d13bef664eefa6 (which looks quite hacky tbh and looks like it might cause stuff like this) or 664d2efa68fe1de0772d4acc253503de83543c87 and presumably doing things that shouldn't be done.
Temporarily worked around with https://github.com/msm8953-mainline/linux/commit/b265c0627c38629719e2f472755b5a9b32603fb2, I believe
Fixed in 6.0.10 branch, the compatible for the spmi regulator that cpr uses was wrong so it didn't probe
Can you test it in both sdm625 and 632?
Is this still relevant?
So @z3ntu and I noticed 5.18 didn't work on our sdm632 devices. We went and compared logs and we found that we had these lines in common in our stack traces:
However z3ntu at first got that from dsi and adreno drivers, while I did with rpmpd. However eventually we find out that disabling CPU OPPs seemed to get us to initramfs(although very slowly) at least. I change match_data in cpufreq_nvmem to kryo because of an oversight from my side that thought it would fix it. And it kinda did: cpufreq-nvmem didn't probe successfully but otherwise my ocean was usable in weston. However I noticed an error about "Not Snapdragon 820/821!" on my log so I reverted and decided to look deeper.
I noticed a line
<3>[ 0.196112] cpu cpu0: Failed to add OPP name speed6-pvs0-v0
. I later find out it's erroring out with -EPROBE_DEFER. A dependency with rpmpd perhaps? Would explain why it fails just before rpmpd panics. And yep, making rpmpd to sleep for 10 seconds made nvmem to probe deferral timeout, confirming my theory.I went and look from the rpmpd side, and it seems the problem starts here. I did some printk debugging and it eventually led to here. But after that I can't get to figure it out. It doesn't help that that function is used quite a lot so I can't just
pr_err
away because my log gets filled out and slows the phone down.Anyway, I also can't get the phone to boot fine again without nvmem(it hangs on initfs). I may have modified something else when I tried the kryo_match_table in nvmem and now I can't figure out what. I also think it's possible disabling CPU OPP works because the phone is too slow to get to rpmpd in the first place.
EDIT: @alikates Said on Matrix that this affects his sdm625 too. Maybe it isn't sdm632 exclusive? Full ramoops: ramoops.txt