mmiller7 / Arris_modem_scrape

12 stars 1 forks source link

Question on what data to monitor #1

Open PeteRager opened 1 year ago

PeteRager commented 1 year ago

In your experience which data elements indicate that the modem has lost connectivity and needs to be rebooted? Do I need to look at all the up and down channels?

mmiller7 commented 1 year ago

That's a very complicated question.

Generally, you are probably better off just doing a simple ping-test to see if it's worth rebooting your modem because the ISP support will demand that anyway. I don't auto-reboot mine, but I do have ping-sensors which monitor my router (192.168.1.1), my modem (192.168.100.1) and something on the public internet (like 1.1.1.1).

If you can reach your router but not your modem, something probably needs rebooting. If you can reach the router and modem but not the internet then maybe it needs rebooting.

I use the data to get alerts when my modem signals go out of spec so I know to keep a closer eye on it. You can look up the acceptable power-levels on the Arris webiste for your particular configuration (it depends on modem model, number channels, etc.).

Generally if your modem's "connectivity state" isn't "operational" or the "acquire downstream channel" isn't "locked" means it's not talking to your ISP properly. If you watch these as the modem boots up (may need to use a web browser to go to the modem's page and refresh every few seconds) you will see it step thru different states as it finds the ISP servers, downloads its configuration, and finally applies its configuration.

Generally if I see the signal levels go wildly out of range and then my connection stops working, I reboot my modem once and then start phoning my ISP to ask for them to investigate. Its mostly to help expedite my troubleshooting so I don't waste time fiddling with my computer or router when the modem signals have gone wonky.

Hopefully that answers your question!

PeteRager commented 1 year ago

Great information. Thank you.

My use case is to have HASS automatically fix modem issues as I’m not always at the residence and able to manually reboot and I use HASS to monitor what’s going on (power, temperature, leaks, etc.). Recently we had a power outage, all the IT equipment is on UPS and there’s a backup generator. Typically spectrum lasts about 2 hours after an outage (batteries on their pole mounted devices?) after that there’s no internet until power comes back on. Most of the time the modem recovers, however during the last one it did not and hence was unable to assess the state of the house. When I got to the house I power cycled the modem and that solved the issue. So now, I’m adding zwave plugs and working on an automation to reset them when the internet ping sensors are down for more than a couple of hours.

Current plan is to reboot the modem, wait a while, check the internet ping sensors and then reboot the router if pings are still bad and utility power is restored. And repeat that every couple of hours until the problem corrects.

Based on your notes, I’ll add ping sensors for the modem and the router. Interesting that you see the modem not respond to pings when it’s failed. I’ll add that info into the automation logging.

BTW, that’s an amazing sed / awk statement in your script! Wow!

mmiller7 commented 1 year ago

I'd call it a very haky awk/sed script, I'm not very good at parsing HTML but it seemed to do the job.

ISP outage like you describe usually the modem stays available but once in a while if it gets in a really bad state it can go unresponsive entirely (or if the IP gets out of sync from the router's WAN it can show as PING fail to the modem).

I find the PING sensors far more useful in identifying "X is broken" but the signal-level and a sudden spike in uncorrectable-count is sometimes useful in "stuff is degrading" before it totally fails.

I made a dashboard so I can at a glance see the status of everything on my network to help me debug (usually from within the house). Basically I tried to go thru "what are my manual troubleshooting steps" and make a sensor that performed each step of my manual troubleshooting process. image

Do make sure with your reboots to make sure it can't run too often - since it can take "a while" for stuff to fully recover from a power cycle. I'd give it like 10 minutes or so between attempts to power-cycle so it can't get stuck in a loop (or if it does, you have a short gap to VPN in and remotely disable the rule to break the loop).

Have you considered a "back-door" to debug stuff? I set up my house with pfSense and multi-WAN failover with a cellular modem and HA has a Telegram bot so I have a way to interact with it even if the cable-modem is down.

mmiller7 commented 1 year ago

The other REALLY good thing with tracking signals (and why I started) is if you have ongoing issues you can sometimes help pinpoint the time of day its happening. Like I eventually made random dashboards and noticed my signals went outa whack and then dropped offline every time it got above like 85-90F temperature outside usually 3-7PM so I was able to ask for a tech to come during that time-slot when it was more likely to act up. It was a hard to track issue because it felt random until I saw the graph next to temperature and it was a perfect match.

PeteRager commented 1 year ago

Nice dashboard and a great setup.

I run pfsense also and have VPN between both houses routers, so it works like a single WAN, 2 independent HASSs.

At most I’d have it run every couple of hours and really want to avoid rebooting the router. Hence if I can ping the modem and grab the disconnected state from it, then I can skip the router reboot and just keep trying the modem until it succeeds. Unfortunately my modem runs an older version of surfboard 9.1.93V and the URLs and page content are different than what this scraper wants. Spectrum allows us to use are own modem, so that’s what we did, but I’m also on my own if it fails. Given it’s worked fine for 3 years and only required a couple of resets, I’m not in a hurry to upgrade it (and have the upgrade fail….). Guess I’ll learn some web scraping also….

I’d like a cellular backup and my routers will work with it. Who do you use? Are you able to dyndns and VPN into it?

mmiller7 commented 1 year ago

I run a VPN (OpenVPN hosted on pfSense) that is accessible over cable but cellular blocks inbound connections so that's why I went with a Telegram bot as a side-channel for control (since Telegram just reaches out to the world, doesn't require listening or care if it has public IP). You can create automations in HA that fire based on Telegram bot commands so as long as the bot is "alive" you can have some basic interaction, like I have something I can do simple things query for a status or request it perform a speedtest and reply with results (there's a Ookla CLI speedtest app that works on Linux and FreeBSD).

For pfsense you can also install the cron package and have it do stuff like bounce the WAN interface if it's not got a good IP or if it's not the primary and run that every X minutes instead of rebooting the entire box. It seems to help it self-recovering faster. Another workaround is some people block a DHCP lease from 192.168.100.1 but that means if the modem is up but ISP is down you won't be able to get to the modem stats/logs because it will have no IP at all.

igb0 is my primary WAN interface (cable modem), here's some snips from my pfSense box if you are interested (pic and text format): image

Script that bounced WAN igb0 if it has a local modem-only IP in hopes that the ISP is back up but pfSense DHCP lease just hasn't expired (THIS MAY BE WHAT HAPPENED TO YOU). The modem will give a 192.168.200.x IP when it can't talk to the ISP head-end to get a public IP, and pfSense might not renew it until the lease expires so that would cause an extended outage. Bouncing it ifconfig down/up should force a renew immediately. Hypothetical situation:

  1. ISP goes down
  2. Router gets local 192.168.100.0/24 IP from modem
  3. ISP comes back up
  4. Lots of time passes when network is nonfunctional because it can only reach the modem on the private address
  5. Eventually private IP expires and router asks to renew
  6. Valid public IP obtained, internet should start working
    /sbin/ifconfig igb0 | /usr/bin/egrep "inet 192.168.100.[0-2][0-9]{0,2}" >> /dev/null && (/usr/bin/logger "CRON-Watchdog: Bouncing igb0 due to IP 192.168.100.x, hopefully force new DHCP from ISP"; /sbin/ifconfig igb0 down; /sbin/ifconfig igb0 up)

Script that bounced WAN igb0 if it appears to be up but isn't primary default gateway (likely means it came back but hasn't failed-over yet): 208.67.222.222 is what I used for my 'monitor IP' on that WAN, it's just one of many public DNS servers. I tried to make it take 2 "tries" before it does anything, and then it does an ifconfig down/up to hopefully force it to reset.

/sbin/ping -c 1 208.67.222.222 > /dev/null && /bin/test `/usr/bin/netstat -rn -4 | /usr/bin/grep default | /usr/bin/awk '{ print $4 }'` != igb0 && ( /bin/test -f /tmp/igb0_up_but_not_primary && ( /usr/bin/logger "CRON-Watchdog: Bouncing igb0 due to /tmp/igb0_up_but_not_primary"; /sbin/ifconfig igb0 down ; /sbin/ifconfig igb0 up ) || ( /usr/bin/logger "CRON-Watchdog: Setting flag at /tmp/igb0_up_but_not_primary"; /usr/bin/touch /tmp/igb0_up_but_not_primary ) ) || ( /bin/test -f /tmp/igb0_up_but_not_primary && (/usr/bin/logger "CRON-Watchdog: Removing flag at /tmp/igb0_up_but_not_primary"; /bin/rm /tmp/igb0_up_but_not_primary) )

There are a few options that might help you get a system that self-recovers more quickly. You can also do tests on the setup by unhooking the coax line from your modem and wait a while to observe what happens, then hook it back up and observe what happens. I had to do that several times to try and reproduce issues.

I also found a bug in pfSense WAN failover that can cause a false-positive on gateway being operational if you ping something that is set to a static-route on the down interface. My workaround is I made firewall deny-rules that only permit LAN -> WAN for the monitor IPs over the interface I want them to work thru. Details on that are here: https://redmine.pfsense.org/issues/11296 Related thread on the same bug (you may recognize my posts with HA history-card graphs of my PING sensors): https://forum.netgate.com/topic/160103/static-routes-not-as-expected

I know we got kinda off topic but I think those tips may help your situation since it sounds similar to mine.

PeteRager commented 1 year ago

Thanks for the detailed notes, I’ve been working through them.

I do see that when I disconnect the coax from the modem, ping start failing to 192.168.100.1