jjpavlik / homemetrics

1 stars 0 forks source link

Rx/Tx counters are wrapping around two often #12

Closed jjpavlik closed 4 years ago

jjpavlik commented 4 years ago

As far as I an see, the Rx/Tx bytes counters seem to be wrapping around quite often, I have the impression they are being handled as unsigned int instead of unsigned long.

The problem right now seems to be the call to __uint_to_char_array() is expecting an unsigned int number.

jjpavlik commented 4 years ago

So, this attempt didn't really improve the situation, there's clearly something else going on. I caught a few higher than 65k values on Rx bytes, which sort of suggests it is indeed going beyond the 16 bits, however is not consistently growing like I would expect. There's something in the code that is either resetting the value or something similar.

Next step is to add the chance of pushing these metrics Rx/Tx, Errors, Rp/Tp to CW to see if I can spot any particular pattern.

-> https://github.com/jjpavlik/homemetrics/tree/Issue-%2312

jjpavlik commented 4 years ago

Having a look at the CW metrics after a few hours, the following is clear:

Note: I need to increase the number of messages retrieved from SQS in pusher.py since the number of queued messages grew considerably since started pushing the other metrics.

jjpavlik commented 4 years ago

Think I got it!!! 98K is in the boundary of the 6 characters, which is exactly METRIC_BUFF_SIZE right now:

``

define METRIC_BUFF_SIZE 6

``

Looks like within __uint_to_char_array() I'm doing a bad job and keeping things control within the while loop. There's a chance the bad boundary control is causing the Arduino code to crash.

jjpavlik commented 4 years ago

Looking better now!

Screenshot 2020-02-02 at 08 23 08
jjpavlik commented 4 years ago

Now the problem doesn't seem to be related to integer wrapping but something else

Screenshot 2020-03-07 at 19 41 31

This happened at ~00:50 UTC on February 17th and then again at ~14:47 UTC on March 6th, numbers are not even close one from the other so is something different this time.

jjpavlik commented 4 years ago

Event on February 17th:

Event on March 6th:

These events SHOULD NOT be causing Arduino crashes... I'm clearly sending some bad data to the Arduino when these APIs failed... I'm inclined to believe the problem actually is the connection issue with Openweather (connection is being refused), maybe I'm passing bad data to the arduino when this happens. It's fun though how both endpoints start failing at the same time...

jjpavlik commented 4 years ago

BTW... the previous logs were on the collector. The pusher shows exactly the same, struggling to reach SQS endpoint in the same timeframe:

2020-02-17 01:08:52,551 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:09:19,459 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:11:31,311 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:11:55,980 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:12:20,124 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:12:50,945 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:13:21,439 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-02-17 01:13:47,641 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-03-06 14:50:07,054 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-03-06 14:50:23,381 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-03-06 14:50:50,831 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/" 2020-03-06 14:51:17,108 - root - ERROR - Could not connect to the endpoint URL: "https://eu-west-1.queue.amazonaws.com/"

jjpavlik commented 4 years ago

So:

Feb 17 01:08:33 raspberrypi kernel: [1392279.420920] smsc95xx 1-1.1:1.0 eth0: link down Feb 17 01:09:12 raspberrypi kernel: [1392318.725167] smsc95xx 1-1.1:1.0 eth0: link up, 100Mbps, full-duplex, lpa 0xC5E1

That explains the name resolution issues, however considering the interface was up by 01:09 this can't really explain the Connection Refused errors between 01:11 and 01:13.

Mar 6 14:49:31 raspberrypi dhcpcd[322]: eth0: deleting address 2a02:8084:2562:9e00:438b:8b4d:ce4:a621/64 Mar 6 14:49:31 raspberrypi avahi-daemon[294]: Withdrawing address record for 2a02:8084:2562:9e00:438b:8b4d:ce4:a621 on eth0. Mar 6 14:49:31 raspberrypi dhcpcd[322]: eth0: deleting default route via fe80::5667:51ff:fe4d:38e6 Mar 6 14:49:31 raspberrypi dhcpcd[322]: eth0: deleting route to 2a02:8084:2562:9e00::/64 Mar 6 14:49:31 raspberrypi avahi-daemon[294]: Leaving mDNS multicast group on interface eth0.IPv6 with address 2a02:8084:2562:9e00:438b:8b4d:ce4:a621. Mar 6 14:49:31 raspberrypi avahi-daemon[294]: Joining mDNS multicast group on interface eth0.IPv6 with address fe80::dbef:bf46:6247:d5c0. Mar 6 14:49:31 raspberrypi avahi-daemon[294]: Registering new address record for fe80::dbef:bf46:6247:d5c0 on eth0.*.

In both cases the mechanism to protect the collected metrics and store them in memory to push them later on once the connectivity issue is sorted failed :).

Mar 6 14:49:55 raspberrypi collector.sh[8404]: Traceback (most recent call last): Mar 6 14:49:55 raspberrypi collector.sh[8404]: File "collector.py", line 233, in Mar 6 14:49:55 raspberrypi collector.sh[8404]: main() Mar 6 14:49:55 raspberrypi collector.sh[8404]: File "collector.py", line 193, in main Mar 6 14:49:55 raspberrypi collector.sh[8404]: store_collected_metric(configuration, timestamp, dev.get_name(), dev.get_sensor_name(sensor), measure) Mar 6 14:49:55 raspberrypi collector.sh[8404]: File "collector.py", line 77, in store_collected_metric Mar 6 14:49:55 raspberrypi collector.sh[8404]: MEASUREMENTS.append(measure) Mar 6 14:49:55 raspberrypi collector.sh[8404]: UnboundLocalError: local variable 'measure' referenced before assignment

causing the collector to crash. The problem is precisely in https://github.com/jjpavlik/homemetrics/blob/master/collector.py#L77 I should be appending message instead of measure.

jjpavlik commented 4 years ago

Pull request done https://github.com/jjpavlik/homemetrics/pull/16. Closing this. Anything similar problems should be dealt on a new Issue