Closed jjpavlik closed 4 years ago
So, this attempt didn't really improve the situation, there's clearly something else going on. I caught a few higher than 65k values on Rx bytes, which sort of suggests it is indeed going beyond the 16 bits, however is not consistently growing like I would expect. There's something in the code that is either resetting the value or something similar.
Next step is to add the chance of pushing these metrics Rx/Tx, Errors, Rp/Tp to CW to see if I can spot any particular pattern.
Having a look at the CW metrics after a few hours, the following is clear:
Note: I need to increase the number of messages retrieved from SQS in pusher.py since the number of queued messages grew considerably since started pushing the other metrics.
Think I got it!!! 98K is in the boundary of the 6 characters, which is exactly METRIC_BUFF_SIZE right now:
``
``
Looks like within __uint_to_char_array() I'm doing a bad job and keeping things control within the while loop. There's a chance the bad boundary control is causing the Arduino code to crash.
Looking better now!
Now the problem doesn't seem to be related to integer wrapping but something else
This happened at ~00:50 UTC on February 17th and then again at ~14:47 UTC on March 6th, numbers are not even close one from the other so is something different this time.
Event on February 17th:
Event on March 6th:
These events SHOULD NOT be causing Arduino crashes... I'm clearly sending some bad data to the Arduino when these APIs failed... I'm inclined to believe the problem actually is the connection issue with Openweather (connection is being refused), maybe I'm passing bad data to the arduino when this happens. It's fun though how both endpoints start failing at the same time...
BTW... the previous logs were on the collector. The pusher shows exactly the same, struggling to reach SQS endpoint in the same timeframe:
2020-02-17 01:08:52,551 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:09:19,459 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:11:31,311 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:11:55,980 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:12:20,124 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:12:50,945 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:13:21,439 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:13:47,641 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-03-06 14:50:07,054 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-03-06 14:50:23,381 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-03-06 14:50:50,831 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-03-06 14:51:17,108 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/"
So:
Feb 17 01:08:33 raspberrypi kernel: [1392279.420920] smsc95xx 1-1.1:1.0 eth0: link down Feb 17 01:09:12 raspberrypi kernel: [1392318.725167] smsc95xx 1-1.1:1.0 eth0: link up, 100Mbps, full-duplex, lpa 0xC5E1
That explains the name resolution issues, however considering the interface was up by 01:09 this can't really explain the Connection Refused errors between 01:11 and 01:13.
Mar 6 14:49:31 raspberrypi dhcpcd[322]: eth0: deleting address 2a02:8084:2562:9e00:438b:8b4d:ce4:a621/64 Mar 6 14:49:31 raspberrypi avahi-daemon[294]: Withdrawing address record for 2a02:8084:2562:9e00:438b:8b4d:ce4:a621 on eth0. Mar 6 14:49:31 raspberrypi dhcpcd[322]: eth0: deleting default route via fe80::5667:51ff:fe4d:38e6 Mar 6 14:49:31 raspberrypi dhcpcd[322]: eth0: deleting route to 2a02:8084:2562:9e00::/64 Mar 6 14:49:31 raspberrypi avahi-daemon[294]: Leaving mDNS multicast group on interface eth0.IPv6 with address 2a02:8084:2562:9e00:438b:8b4d:ce4:a621. Mar 6 14:49:31 raspberrypi avahi-daemon[294]: Joining mDNS multicast group on interface eth0.IPv6 with address fe80::dbef:bf46:6247:d5c0. Mar 6 14:49:31 raspberrypi avahi-daemon[294]: Registering new address record for fe80::dbef:bf46:6247:d5c0 on eth0.*.
In both cases the mechanism to protect the collected metrics and store them in memory to push them later on once the connectivity issue is sorted failed :).
Mar 6 14:49:55 raspberrypi collector.sh[8404]: Traceback (most recent call last):
Mar 6 14:49:55 raspberrypi collector.sh[8404]: File "collector.py", line 233, in
causing the collector to crash. The problem is precisely in https://github.com/jjpavlik/homemetrics/blob/master/collector.py#L77 I should be appending message instead of measure.
Pull request done https://github.com/jjpavlik/homemetrics/pull/16. Closing this. Anything similar problems should be dealt on a new Issue
As far as I an see, the Rx/Tx bytes counters seem to be wrapping around quite often, I have the impression they are being handled as unsigned int instead of unsigned long.
The problem right now seems to be the call to __uint_to_char_array() is expecting an unsigned int number.