the-modem-distro / pinephone_modem_sdk

Pinephone Modem SDK: Tools to build your own bootloader, kernel and rootfs
GNU General Public License v3.0
595 stars 64 forks source link

Sometimes modem disappears partially [mobian] #9

Closed kkeijzer closed 3 years ago

kkeijzer commented 3 years ago

Sometimes the modem disappears partially. gnome-control-center will say "No wireless / QMI device found", ip a will still show a wwan0 device, but with no IP addresses or routing tables, and gnome-calls will state that there is no voice-capable modem detected.

This can happen after resuming from deep sleep, but sometimes also while the phone is awake. (I sometimes wake up finding the modem "broke" overnight, while the phone was charging and not going to deep sleep.)

The modem is still detected with lsusb in this state; no different from normal operation:

Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 008: ID 2c7c:0125 Quectel Wireless Solutions Co., Ltd. EC25 LTE modem
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

The only way to get the modem to work again, is by running adb shell reboot from the PinePhone. (adb reboot does not work.)

Attached are openqti.log and dmesg of both the modem and the PinePhone.

I can't see anything out of the ordinary. Do you have any suggestions of where I should look the next time this happens?

Biktorgj commented 3 years ago

There is a race condition between eg25manager and ModemManager in mobian, amd generally, in all distros using ModemManager that haven't been updated with Dylan's patches for suspend/resume. The problem is ModemManager disconnects the modem on suspend and reprobes the modem on resume, and that takes about 10-15 seconds. If it takes a bit longer than expected, eg25manager ends up unbinding and rebinding modem's usb port, making modemmanager segfault and restart in the middle of the probe. All that would just be annoying, the problem comes from QMI. When ModemManager connects it registers itself as a qmi client. The modem supports, if I recall, about 20-25 concurrent clients. But when modemmanager dies or is unbinded that client is not released from the modem, and it ends up rejecting new clients when it's full, forcing you to restart... I am waiting for mobian to update ModemManager with those patches (Dylan's patches were ready to be merged to ModemManager last time I checked)

DylanVanAssche commented 3 years ago

@Biktorgj My patches are shipped as modemmanger-git package

kkeijzer commented 3 years ago

@DylanVanAssche Where can I find this package? Is a deb being built? Because I can't find it in the repositories.

Biktorgj commented 3 years ago

You'll probably need to switch to the unstable branch (https://blog.mobian-project.org/posts/2021/03/15/unstable-distro/) and then simply apt install modemmanager-git

I'm currently testing the package too

kkeijzer commented 3 years ago

Is it working for you? Voice calls and sms work for me, but data does not.

I keep getting

NetworkManager[601]: <warn>  [1622413024.3334] modem-broadband[cdc-wdm0]: failed to connect modem: Couldn't bind mux data port: QMI protocol error

when I try to enable the mobile connection.

I have installed these packages:

libmbim-glib4-git libmbim-proxy-git libmm-glib0-git libqmi-glib5-git libqmi-proxy-git libqrtr-glib0 modemmanager-git

Am I missing anything else?

Biktorgj commented 3 years ago

Not as far as I can see, and it doesn't work for me either in factory firmware or custom, so I guess it's missing some polishing... (works perfectly fine in pmOS)

PsychoGame commented 3 years ago

Looks like Manjaro Arm Plasma Mobile is also affected by this. Although I don't know if it's really a modem issue. My data worked fine until ModemManager got updatet. But i'll try stock modem again to see if data works on there, so I can really tell if it's modem related or just upstream package trouble.

Biktorgj commented 3 years ago

Looks like an issue on my side. Still investigating, though it looks, like ModemManager now tries to bind to some port and is unable to... either there has been some change in libqmi or modemmanager now does thing differently.

Biktorgj commented 3 years ago

Sometimes the modem disappears partially. gnome-control-center will say "No wireless / QMI device found", ip a will still show a wwan0 device, but with no IP addresses or routing tables, and gnome-calls will state that there is no voice-capable modem detected.

This can happen after resuming from deep sleep, but sometimes also while the phone is awake. (I sometimes wake up finding the modem "broke" overnight, while the phone was charging and not going to deep sleep.)

The modem is still detected with lsusb in this state; no different from normal operation:

Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 008: ID 2c7c:0125 Quectel Wireless Solutions Co., Ltd. EC25 LTE modem
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

The only way to get the modem to work again, is by running adb shell reboot from the PinePhone. (adb reboot does not work.)

Attached are openqti.log and dmesg of both the modem and the PinePhone.

I can't see anything out of the ordinary. Do you have any suggestions of where I should look the next time this happens?

I just posted a pre-release that might mitigate the issue described here (if using ModemManager in Mobian -not modemmanager-git- )

If you're still running Mobian you can give it a try: https://github.com/Biktorgj/pinephone_modem_sdk/releases/tag/0.2.4 It won't make the modem recover any faster on resume, but should hopefully make it not need a reboot after a bunch of suspend-resume cycles

kkeijzer commented 3 years ago

To be honest, this makes things worse. Resuming from suspend, I now never have internet. When using IPV4V6 as the PDP type, I get two IPv6 addresses on the interface, none of them working. When I use IP as the PDP type, I get no IP address at all.

After the modem is woken up, NetworkManager seems to hang on every interaction. Running something like nmcli connection up Default will just time out.

Biktorgj commented 3 years ago

Interesting, works perfectly fine in postmarketOS. Could you share a full dmesg and the contents of /var/log/openqti.log?

Edit: also, are you running modemmanager-git or modemmanager?

kkeijzer commented 3 years ago

I'm running modemmanager on Mobian unstable. modemmanager-git does not work at all. Or does it with this release?

Biktorgj commented 3 years ago

This won't fix anything with modemmanager-git since it's badly broken, uust wanted to know your setup to try to replicate when I get some wifi to get latest mobian again

If you have a spare sd, it'd be interesting if you could run a live postmarketOS edge/phosh distro to check if your IPV4/6 issues are fixed, since pmOS is bundling a much newer modemmanager version.

In any case, I'd need to see what is going on with the firmware to know where to touch next :)

kkeijzer commented 3 years ago

dmesg.txt openqti.log modemmanager.log

Biktorgj commented 3 years ago

Okay, that makes sense I guess. I wasn't tracking the release requests from the host, and, as in Mobian ModemManager releases itself prior to suspend, when reattaching to the modem it was kicking it out only to reconnect, which made matter worse.

Let's see if this fixes it... This is only a rootfs image, please flash it on top of your system partition, adb shell reboot; fastboot oem stay; fastboot flash system rootfs-mdm9607.ubi; fastboot reboot should be enough

Let me know if this works for you mobian_test.tar.gz

Added some more logging to this image, so if it still doesn't work, please get me a copy of openqti.log so I can see what's going on with the registered clients ;)

kkeijzer commented 3 years ago

My first impression is that this seems to be working. I resumed from suspend and I have a working connection now.

kkeijzer commented 3 years ago

adb, however, doesn't seem to work any more:

kevin@pinephone:~$  adb devices
List of devices attached
kkeijzer commented 3 years ago

Hmm, never mind. Rebooting to fastboot and then rebooting back seems to fix it. Maybe it was just a glitch.

I'll keep on testing. So far it seems to be working well. Let's hope it stays connected without having to be rebooted every now and then.

Biktorgj commented 3 years ago

Found one more possible bug from ModemManager that I hadn't accounted for:

On resume with that version of ModemManager, if you suspend too quickly after wake up, it can start saturating the modem again. Here's a newer version with a fix for that. If OpenQTI finds the ADSP is handling more than 32 clients it will kill all of them and start from zero (I saw that when doing a lot of quick suspend-resume cycles client count could go up to more than 200 concurrent connections, it's not normal to behave like that but maybe some users have suspend set to 1 second or something like that) mobian_test2.tar.gz

About ADB, you can enable it with AT+ADBON

At this point, I'm handling (correctly I hope)

If you find anything else please let me know!

kkeijzer commented 3 years ago

I suspend after two minutes, so it probably won't be an issue for me, but I flashed it and will test it.

I tried AT+ADBON, but that didn't help. Rebooting the phone didn't help either. But going to fastboot and back apparently did help, and got ADB working again. Weird.

I'll see if I can find the time to try another distro to test a newer ModemManager. In my previous ModemManager log, you could also see that IPv4 wasn't working on a dual stack connection. I guess that would be my only remaining issue.

kkeijzer commented 3 years ago

I still get some issues where I have to reboot the modem or restart ModemManager after resume. The main error I get is that the PPP dialer won't start.

dmesg.txt openqti.log modemmanager.log networkmanger.log

Biktorgj commented 3 years ago

That's on me, I messed up in the rmnet proxy thread function I think... Third time is the charm?

mobian_test3.tar.gz

kkeijzer commented 3 years ago

It seems to be working better. I've been running it since last night, and resuming from suspend works all right now.

However, this afternoon, I saw my network was down while the PinePhone was plugged in (so not sleeping). I checked ip addr and saw I had two IPv6 addresses again; none of them working. I had to run adb shell reboot before I could get it to work again.

dmesg.txt openqti.log modemmanager.log networkmanager.log

P.S. I also tested PostmarketOS today, and with the newer ModemManager the IPv4/IPv6 dual stack bug is still present. I only get IPv6 addresses unless I re-register with IP as the PDP type and then change it to IPV4V6 so IPv6 is added.

Biktorgj commented 3 years ago

Okay, another test build, 0.2.7. I'm not going to say this fixes the disappear issues completely until you tell me so, because everytime I think I fixed it another way of crashing appears.

https://github.com/Biktorgj/pinephone_modem_sdk/releases/tag/0.2.7

Once thing I was thinking about your disconnect-reconnect. If you're custom handling the modem profile and it suspends and resumes and at some point something breaks and ModemManager thinks it is a new modem, you might want to do a test run disabling suspend hooks in ModemManager. It's not great but it will stop trying to detach itself from the modem everytime you suspend, and wil recover more quickly. If a USB disconnect happens inside the modem and the connection dies and it needs to renegotiate, this last firmware "might" (|| should || will?) handle it better than previous ones.

To change it, add the command line arg " --test-no-suspend-resume" to the modemmanager systemd service: /usr/lib/systemd/system/ModemManager.service

ExecStart=/usr/sbin/ModemManager --test-no-suspend-resume

kkeijzer commented 3 years ago

I haven't yet tested disabling the suspend hooks, but the current firmware does not seem to change much otherwise. In fact, it looks as if it randomly disconnects now. And sometimes when I run adb shell reboot, the phone seems to lock up, as if something is generating a lot of I/O.

dmesg.txt openqti.log modemmanager.log networkmanager.log

Biktorgj commented 3 years ago

There's something funky with your pinephone at the 520 second mark: [ 520.363640] msm_thermal:do_freq_control Limiting CPU0 max frequency to 1190400. Temp:60

The modem was hitting the temperature throttling threshold, don't know if it was charging or what...

When I close all the connections to the Modem when modemmanager fails to wakeup or suspend, I found it doesn't seem to like to get a ton of packets it didn't request. I'm currently testing draining them before returning control to it, and at least in pmOS it behaves as expected. Flashing Mobian again to a SD to test out, but in case you want to give it a shot:

mobian_test5.tar.gz

kkeijzer commented 3 years ago

It was indeed charging, but it was on my desk, so it's weird that it got so hot.

Anyway, tested the last firmware, but I still have to reboot the modem after resume.

dmesg.txt openqti.log modemmanager.log networkmanager.log

Biktorgj commented 3 years ago

Ok, with latest firmware, can you update only the adsp?

Download this: https://github.com/Biktorgj/quectel_eg25_recovery/blob/EG25GGBR07A08M2G_01.003.01.003/update/NON-HLOS.ubi

And flash it from fastboot: 'fastboot flash modem NON-HLOS.ubi'

In this latest firmware WDS seems to work better than the previous one. Maybe the ADSP firmware is less buggy and the userspace side was worse?

Edit to add some more info: One of the errors I've seen in modemmanager is that it couldnt register a new data connection because there was another one established after release-reconnect. This left the modem without any option except a reset because it left a dead data service active. I've updated the adsp firmware and so far I haven't been able to make it die, either by suspend-resume or by killing modemmanager in the middle of a transaction.

This firmware has known stability issues as a whole, but now I'm wondering if Quectel updated the adsp but not the userspace applications or introduced some bugs in the userspace that made it unstable. With this we can at least clear that up.

Also, once this is finally working correctly, all mobian users should add the no-suspend-resume option to modemmanager (mentioned earlier). This will remove the 15 second latency of modemmanager working after you suspend the phone. It might introduce some buggy scenarios, depending on how usb stack lands on resume, but most of those should be handled already by this firmware (YMMV!)

kkeijzer commented 3 years ago

Ok, testing with version 0.2.7 + mobian_test5.tar.gz and NON-HLOS.ubi now. I also added the --test-no-suspend-resume parameter to ModemManager.

By the way, it is cleaner to use an override instead of editing stuff in /lib/systemd/

So:

sudo systemctl edit ModemManager.service

Paste this:

[Service]
ExecStart=
ExecStart=/usr/sbin/ModemManager --test-no-suspend-resume

And then:

sudo systemctl daemon-reload sudo systemctl restart ModemManager.service

kkeijzer commented 3 years ago

So far it's been working pretty well. I was able to resume from suspend immediately this way. But after a couple of times, when the phone woke up, the modem was gone. Invisible with lsusb, so I wasn't able to use ADB. I tried running systemctl restart eg25-manager.service, but to no avail. Only rebooting the phone seemed to help.

dmesg-pinephone.txt modemmanager.log networkmanager.log

kkeijzer commented 3 years ago

Hmm, it just broke again with the 'two IP addresses, none of them working' problem. It wasn't even suspending when this happened.

dmesg.txt openqti.log modemmanager.log networkmanager.log

Biktorgj commented 3 years ago

On the disappearing side, I see the kernel failing on suspend with the SDIO driver which I've never seen before... is your SD card okay? :)

I also wonder if your vpn client is forcing everything to go through the tunnel, and if the tunnel is dying on suspend blocking all connections outside, did you check that?

Let's try another quirk, but this time from the kernel. Your logs do not show anything out of the ordinary unless I missed something, and I've spent 16 hours with no resets-no crashes-always recovering with latest firmware+modem ADSP firmware in Mobian unstable (without no-suspend-resume), and some time with no suspend-resume and a script that made the system suspend without giving the phone time to settle in between. I patched the modem kernel so it doesn't send the missed disconnect signals, and I even got ModemManager to crash but it recovered itself and got network again, so I'm now at a point where I can't seem to make it "die".

Please try this modified kernel, no need to flash it, just a one time boot: adb shell reboot; fastboot oem stay; fastboot boot mdm9607-perf-boot.img

modem_kernel_test_no_missed_disconect.tar.gz

And let's see how it goes

EDIT: I don't understand how your Network manager is doing its stuff. First it get an IPv6, then tries to connect to the VPN, then attempts to get a dhcp lease for IPv4 in the modem, but by that time the VPN is already unavailable because it's getting a IPv4 address. Have you tried to disable IPv6? That seems to be the difference between your setup and mine

Biktorgj commented 3 years ago

So far it's been working pretty well. I was able to resume from suspend immediately this way. But after a couple of times, when the phone woke up, the modem was gone. Invisible with lsusb, so I wasn't able to use ADB. I tried running systemctl restart eg25-manager.service, but to no avail. Only rebooting the phone seemed to help.

dmesg-pinephone.txt modemmanager.log networkmanager.log

If this happens again, please connect through minicom to the serial port (/dev/ttyS2) and issue an AT+RESETUSB, maybe it can come back this way. There's a bug I'm looking into in the A64 USB stack, where it will start failing probing the modem USB and will just drop messages to the kernel log stating "Maybe USB cable is bad". If that's the case the modem won't come back, but the AT interface will still be working fine since it doesn't depend on USB

kkeijzer commented 3 years ago

I'm pretty sure my SD card is okay. But my / is on the eMMC. The SD card only contains /home.

The reason I'm using the VPN is because the dual stack setup is so broken. When I use IPV4V6 as PDP type, I only get an IPv6 address, because ModemManager for some reason then thinks IPv4 is not allowed by the APN. When I use IP as PDP type, I only get an IPv4 address (obviously), but then I don't have IPv6 at all. The VPN can connect over either IPv4 or IPv6 and then do 6in4 or 4in6. NetworkManager starts the VPN as a secondary for the LTE connection, so the VPN is not started at all until the LTE connection is up.

If we could find a way that dual stack would just work, I probably wouldn't be using the VPN at all.

The only way for me to get that to work now, is running this script after every (re)connection:

#!/bin/bash

mmcli -m any --command='AT+CGDCONT=1,"IP"'
mmcli -m any --command='AT+CFUN=0'
mmcli -m any --command='AT+CFUN=1'
sleep 1
mmcli -m any --command='AT+CGDCONT=1,"IPV4V6"'
nmcli connection up 'Standaard'

exit 0

But I'll try with the modified kernel, IP as the PDP type, and no VPN. But not having IPv6 will be really litiming though.

kkeijzer commented 3 years ago

So far it seems to be going all right with this kernel. I'm using dual stack IPv4+IPv6 using the script above, but the VPN is disabled. I would have to run that script on every (re)connect, but I guess I can live with that for now. I'm also still using --test-no-suspend-resume.

Should I eventually flash this kernel?

Biktorgj commented 3 years ago

Ifkeepit's working correctly you can flash it :)

I think part of what is messing with the script could be timing, issuing a cfun=0 next to a cfun=1 without some sleep might be too fast for the modem, maybe you can add some sleeps in between and see if it helps

All my latest patches should help keeping ModemManager in check wrt suspend so you can keep hooks disabled too

kkeijzer commented 3 years ago

The script itself is working fine. The problem is that without the script, ModemManager only gets an IPv6 address when the PDP type is IPV4V6. For some reason, when I register to the network with IPV4V6, ModemManager claims the IPv4 call is not allowed, and only gives me an IPv6 address.

When I register with IP as the PDP type, I do get an IPv4 address, but (obviously) no IPv6. But when I change from IP to IPV4V6 after registering (so registering with IP and then changing to IPV4V6 and restarting the connection), I do get both an IPv4 address and an IPv6 address. It's really weird..

kkeijzer commented 3 years ago

I'll add some logs.

Probably not interesting:

dmesg.txt openqti.log

When the modem is booted with IPV4V6 as PDP type, and registers like that:

modemmanager-before.log networkmanager-before.log

As you can see, IPv4 fails, and I only get an IPv6 address.

When I run the script, which changes the PDP type to IP, registers again to the network, and then changes it to IPV4V6 immediately afterwards:

modemmanager-after.log networkmanager-after.log

So for some reason, registering with IPV4V6 breaks IPv4, but changing it to IPV4V6 after registering allows both to work.

The kernel is flashed by the way. I'll continue testing.

Biktorgj commented 3 years ago

Hey @kkeijzer , is your modem still working or have you needed to reboot?

PsychoGame commented 3 years ago

Today's update of packages on Manjaro Phosh development made everything go bad. After each sleep session I need to reboot to get mobile back online. will try the no-suspend-resume hook to see if this improves the situation

Biktorgj commented 3 years ago

Get me a dmesg of the PinePhone kernel when it happens next time, please!

PsychoGame commented 3 years ago

Edit: Here is the dmesg of the PinePhone kernel. I think I already may have found part of the culprit. This morning when going to the amusement park I disabled adb from booting up. Later on to try and debug some things I re-enabled persistent ADB boot again. This makes the modem perfectly stable again. My gut feeling tells me that (at least my issue) is caused by disabling ADB. I'll try to investigate some more why the modem dislikes the disabled ADB. Maybe I'll set up some persistent logging on the modem itself to figure out what happens after the usb connection dies dmesg.log

I have the logs for you on my phone already. Will upload them later in the day. Currently I'm in an amusement park with my kids.

kkeijzer commented 3 years ago

Hey @kkeijzer , is your modem still working or have you needed to reboot?

So far it's still working. I rebooted the entire phone once for a different reason though. But I travelled to work and back today, and the connection didn't drop along the way. So I'm quite positive for the time being.

kkeijzer commented 3 years ago

Unfortunately I celebrated too early..

dmesg.txt openqti.log modemmanager.log networkmanager.log

Biktorgj commented 3 years ago

Are you still using the no suspend resume param?

kkeijzer commented 3 years ago

Yes. And the phone was plugged in (so not going to sleep) when this happened.

Biktorgj commented 3 years ago

So this is a message I hadn't seen in a long time:

[16023.571441] msm_hsusb msm_hsusb: CI13XXX_CONTROLLER_RESUME_EVENT received
[16023.571767] msm_otg 78d9000.usb: Avail curr from USB = 500
[16023.578149] msm_otg 78d9000.usb: Avail curr from USB = 2
[16023.578305] msm_hsusb msm_hsusb: CI13XXX_CONTROLLER_SUSPEND_EVENT received
[16023.595749] udc_irq: USB reset interrupt is delayed
[16023.595862] msm_hsusb msm_hsusb: CI13XXX_CONTROLLER_RESUME_EVENT received
[16023.596080] msm_otg 78d9000.usb: Avail curr from USB = 500
[16023.596309] msm_otg 78d9000.usb: Avail curr from USB = 100
[16023.596453] diag: USB channel diag disconnected
[16023.935561] android_usb gadget: high-speed config #1: 86000c8.android_usb
[16023.941484] diag: USB channel diag connected
[16023.946852] msm_otg 78d9000.usb: Avail curr from USB = 500
[16036.505955] msm_otg 78d9000.usb: Avail curr from USB = 2

So at some point the Pinephone issued an USB reset, could you grab the log from eg25manager just in case? My test is so far 23 hours 47 minutes without a crash, reset, reboot or anything needed in pmOS Edge

Did you both @kkeijzer @PsychoGame have to reboot the modem or the phone to make it work again or it recovered?

kkeijzer commented 3 years ago

It didn't recover. I had to run adb shell reboot. Attached is yesterday's eg25-manager log.

eg25-manager.log

DylanVanAssche commented 3 years ago

It didn't recover. I had to run adb shell reboot. Attached is yesterday's eg25-manager log.

eg25-manager.log

Ah interesting! the reset from above comes from the eg25-manager... So it is not from the A64 or modem kernel, but the A64 userspace.

PsychoGame commented 3 years ago

Hello @Biktorgj,

No my modem didn't recover automatically in my case. I tried restarting eg25-manager and also NetworkManager, but to no avail. In my case had to reboot the phone to get it working again for a short while. In my case enabling persistent ADB makes everything more stable. At the moment can't really figure out why.