sparky8512 / starlink-grpc-tools

Random scripts and other bits for interacting with the SpaceX Starlink user terminal hardware
The Unlicense
478 stars 64 forks source link

Hangs after starting influx script #36

Closed bile0026 closed 2 years ago

bile0026 commented 3 years ago

Environment: Server OS CentOS 8 - 4.18.0-305.19.1.el8_4.x86_64 Python venv version 3.6.8

command: python3 dish_grpc_influx.py -v status alert_detail -t 10 -U <username> -P <password>

Results: Running command, in this case it's running as a service, but does the same then when not running as a service.

Nov 02 22:18:17 BIL-STARLINK systemd[1]: Started Starlink Influx Script.

Hangs at this point and never runs.

bile0026 commented 3 years ago

Moved the same script to an ubuntu box and it runs fine. Must be something CentOS specific? Is there additional logging that can be enabled?

sparky8512 commented 3 years ago

Thanks, I was about to ask if you could try it on another system, as I have been unable to reproduce such behavior. This at least confirms that it's not the result of some firmware update that your dish got that mine has not (which I did not think likely, but good to know for sure).

There's not much logging in the scripts themselves, and -v already enables it. There may be more that can be enabled in the protobuf module, but I'd have to go digging into how to enable that. It should be possible to determine where it's hanging, though, by running the script in an interactive terminal, and once it hangs, interrupt it with Ctrl-C. That should print the stack trace of where it was when interrupted.

bile0026 commented 3 years ago

Looks like it's timing out if I'm reading this correctly? I can telnet to dishy on port 9200, so that's weird.

Traceback (most recent call last):
  File "dish_grpc_influx.py", line 330, in <module>
    main()
  File "dish_grpc_influx.py", line 311, in main
    rc = loop_body(opts, gstate)
  File "dish_grpc_influx.py", line 254, in loop_body
    rc = dish_common.get_data(opts, gstate, cb_add_item, cb_add_sequence, add_bulk=cb_add_bulk)
  File "/home/zbiles/starlink/dish_common.py", line 196, in get_data
    rc = get_status_data(opts, gstate, add_item, add_sequence)
  File "/home/zbiles/starlink/dish_common.py", line 231, in get_status_data
    groups = starlink_grpc.status_data(context=gstate.context)
  File "/home/zbiles/starlink/starlink_grpc.py", line 592, in status_data
    status = get_status(context)
  File "/home/zbiles/starlink/starlink_grpc.py", line 552, in get_status
    return call_with_channel(grpc_call, context=context)
  File "/home/zbiles/starlink/starlink_grpc.py", line 422, in call_with_channel
    return function(channel, *args, **kwargs)
  File "/home/zbiles/starlink/starlink_grpc.py", line 547, in grpc_call
    resolve_imports(channel)
  File "/home/zbiles/starlink/starlink_grpc.py", line 359, in resolve_imports
    importer.resolve_lazy_imports(channel)
  File "/home/zbiles/starlink/lib64/python3.6/site-packages/yagrc/importer.py", line 335, in resolve_lazy_imports
    _lazy_importer.configure(channel, filenames=filenames)
  File "/home/zbiles/starlink/lib64/python3.6/site-packages/yagrc/importer.py", line 201, in configure
    loaded = self.reflector.load_protocols(channel, filenames=filenames)
  File "/home/zbiles/starlink/lib64/python3.6/site-packages/yagrc/reflector.py", line 106, in load_protocols
    for response in responses:
  File "/home/zbiles/starlink/lib64/python3.6/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/home/zbiles/starlink/lib64/python3.6/site-packages/grpc/_channel.py", line 817, in _next
    _common.wait(self._state.condition.wait, _response_ready)
  File "/home/zbiles/starlink/lib64/python3.6/site-packages/grpc/_common.py", line 141, in wait
    _wait_once(wait_fn, MAXIMUM_WAIT_TIMEOUT, spin_cb)
  File "/home/zbiles/starlink/lib64/python3.6/site-packages/grpc/_common.py", line 106, in _wait_once
    wait_fn(timeout=timeout)
  File "/usr/lib64/python3.6/threading.py", line 299, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
sparky8512 commented 3 years ago

Hmm.... I expect if it were actually timing out, it should fail with an error, not hang. I'll have to check what level is actually setting the timeout in this case. But even if the problem is that it's not timing out when it should, that still begs the question of why it's not getting a response when it should. When you say you can telnet to the dish, you are testing from that same system, yes?

Anyway, this appears to be somewhere in the guts of the grpc module where it's opening the channel, probably the very first time it has tried connecting to the dish, since it's still trying to get the service protocol definitions via reflection.

bile0026 commented 3 years ago

That is correct, testing from the same box I'm running the scripts on.

Since I have another box I can run things on for now I'm not completely stuck, but if you happen to find anything out I'd be interested in knowing what you find. I'm also willing to provide additional testing/outputs if it's helpful to fix it for someone else as well.

It's just odd because it's been working for weeks, and all of a sudden the other day it stopped.

sparky8512 commented 3 years ago

It's just odd because it's been working for weeks, and all of a sudden the other day it stopped.

That may make it difficult for me to reproduce. I can install that exact OS version in a VM, but if it was working before, then it may not be related to the OS at all. Since you're logging status, do you happen to know if there was a dish firmware update around the time it stopped working? I still doubt firmware changes would have anything to do with this, but any clue would be useful here.

It would be interesting to see a tcpdump for 192.168.100.1 port 9200 from this system if you can run one.

BTW: I have confirmed that this traceback does correlate with what I see when I interrupt prior to timeout if I block traffic to the port via firewall rule. However, if I don't interrupt, it does eventually fail with an error. Actually, it then goes on to crash the script with an AttributeError rather than failing gracefully, so that's something I should fix, but that's not the root cause of your problem here.

bile0026 commented 3 years ago

Here's a tcpdump if that helps. I see syn/ack both ways 🤷

capture.pcap.zip

sparky8512 commented 3 years ago

Huh. Well, I guess that explains why it didn't time out, and why you are able to telnet to the port. The TCP connection opens successfully and there is 2-way data communication, but it appears the response from the dish just stops shortly after opening, and without closing the connection. None of the lower level data layers know that anything is amiss.

I'm going to compare the TCP traffic from a normal case and see if I can dig up anything.

sparky8512 commented 3 years ago

So... it looks like the point where it stopped getting packets from the dish was exactly the point where the dish would have started sending larger Ethernet packets. In fact, it probably would have been a jumbo packet, but if those fail to be ACKed, it looks like the dish does properly fall back to normal sized (1500 byte) Ethernet packets.

I was able to reproduce a hang at this same point by using a firewall rule to forcibly drop inbound port 9200 traffic with packet length 1500 bytes or larger. The sender won't ever close the TCP connection in this case because it thinks the link has physically broken, and the receiver never errors out because it thinks the connection is still open and just hasn't gotten the data it was waiting for yet.

While this is a corner case error, hanging in the face of it is a particularly poor way to handle it. I did run into a hang with the script running on my Ubuntu system at one point, and I never did figure out what the problem was, but I suspect it was something similar to this. I was testing dish reboots around that time and if the dish happened to reboot or power cycle in the middle of an open request, I expect the same hang behavior would occur.

I believe there is a way for the application to specify a completion timeout on the grpc callable interface, and I will look into doing so, but all that will do is turn what was a script hang into a more graceful error.

So that leaves the question of why is this happening on your CentOS system. Maybe a network infrastructure issue? I would not expect that to be specific to this application, though. Rather, I would think that would break any connection that was receiving a significant amount of data.

bile0026 commented 3 years ago

Hmm yeah that could be it. My server is actually running in the cloud and I have a VPN back to my home which used to be an L2TP VPN and I just switched to wireguard around the same time. I may dig into that a bit more and see if there's something there.

The Ubuntu system is at home and is using the same wireguard tunnel to write the data to the influx db on my VPS, but maybe influx isn't as picky about packet sizing?

sparky8512 commented 2 years ago

I believe those changes should resolve the hang behavior, although to be clear, they will still result in a (graceful) error being reported. I think for the InfluxDB script that will result in DISH_UNREACHABLE being reported for state in status. Although, I think it will only log that to Influx if it knows the dish hardware ID, which will only be the case if the script had succeeded in getting status data from the dish in the past.

Note that the hang reported on this issue was actually in the yagrc module, although there was a similar issue in the starlink-grpc-tools scripts, too, so you'll need both the new starlink-grpc-tools changes and an updated yagrc package version to get the complete fix. You can get a sufficient yagrc package version by doing the following after updating starlink-grpc-tools:

python3 -m pip install -r requirements.txt

I'm going to close this issue out on the assumption that the root cause here was actually a network infrastructure issue that exposed a badly handled error case. Feel free to re-open if you find differently.

I do want to note that I appreciate you reporting this issue, as it led me to investigate the likely case of what had been a puzzling intermittent hang issue that I probably would not have otherwise uncovered.