relay server stopped relaying readings to main app

williamcodes commented 8 years ago

At around 4am this morning (Tue 12/15/2015) the relay server stopped relaying readings to the main app. The last reading came in at 4:14am from sensor FCD0. I restarted the server and they started transmitting readings once every 10 seconds.

williamcodes commented 8 years ago

Readings have stopped being transmitted again. It appears that the server thinks they have all been transmitted, however based on this relay hub page most readings have not yet been transmitted.

hrldcpr commented 8 years ago

Are you sure all readings stopped being transmitted, or just that cell's?

That cell is getting a 400 from the main server, due to a duplicate reading:

> journalctl -ru batch
…
Dec 15 17:16:46 setup.heatseeknyc.com docker[982]: ERROR:root:request {"reading": {"verification": "c0ffee", "temp": 69.2, "sensor_name": "0013a20040d847e2", "time": 1450139524.0}} got 400 response {"code":400,"error":"Already a reading for that sensor at that time"}
Dec 15 17:16:46 setup.heatseeknyc.com docker[982]: INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): heatseeknyc.com
Dec 15 17:16:46 setup.heatseeknyc.com docker[982]: INFO:root:POSTing {'verification': 'c0ffee', 'temp': 69.2, 'sensor_name': '0013a20040d847e2', 'time': 1450139524.0}...

If we look for that sensor and time in the db, there is indeed a duplicate:

> bash relay/db/connect.sh
>> select * from temperatures where cell_id='0013a20040d847e2' and hub_time=to_timestamp(1450139524);
  id   |      hub_id      |     cell_id      | temperature | sleep_period | relay |        hub_time        |             time              |         relayed_time
-------+------------------+------------------+-------------+--------------+-------+------------------------+-------------------------------+-------------------------------
 30342 | 0013a20040d847e6 | 0013a20040d847e2 |        69.2 |       359000 | t     | 2015-12-15 00:32:04+00 | 2015-12-15 00:32:09.087055+00 | 2015-12-15 14:30:58.721908+00
 30344 | 0013a20040d8480d | 0013a20040d847e2 |        69.2 |       359000 | t     | 2015-12-15 00:32:04+00 | 2015-12-15 00:32:12.241074+00 |
(2 rows)

So anyway, I don't know why there's a duplicate, but apparently they can happen. I think for now I'll just delete it, but if we see duplicates often then we'll need to think of an actual solution.

hrldcpr commented 8 years ago

oh i only just noticed that the hub_id is different on the otherwise duplicate readings, interesting…

so it would seem that swapping hub chips can somehow lead to a duplicate. strange.

anyway, reading is deleted, so everything should be transmitting soon.

hrldcpr commented 8 years ago

Actually, there seem to be other duplicates from chip swapping, so strange…

hrldcpr commented 8 years ago

select cell_id, hub_time, count(*) from temperatures group by cell_id, hub_time having count(*) > 1 order by hub_time desc;

williamcodes commented 8 years ago

Interesante. Was the 10 second pause between readings coming from throttling in the relay server code? If so, we might want to reduce that to 1 second. If not, we should check and see if we should get a bigger machine for the relay server.

williamcodes commented 8 years ago

According to free -m

             total       used       free     shared    buffers     cached
Mem:           493        479         14         19         52        247
-/+ buffers/cache:        178        314
Swap:            0          0          0

Looks like haven't hit swap file usage but we're getting pretty close. Would a bigger machine help?

hrldcpr commented 8 years ago

I actually tried switching to a machine with 2x the RAM and it filled up almost immediately, which makes me think there's some flexible stuff going on (e.g. something is opportunistically allocating memory based on what's available?) but we should look into it. And perhaps either way we should just go for it, see if it helps.

I haven't had a chance to look at the logs for the 504s yet, but once we do we can hopefully figure out if it's related to RAM or some other wacky thing.

But anyway that's unrelated to this issue as far as I can tell, doesn't look like the batch process ever died, it just gave up on cells with duplicate readings.

williamcodes commented 8 years ago

I checked several users before we deleted the duplicates and the relay server hadn't transmitted readings since 4am. So I guess that means they all had dupes? I wish I'd made a note of which users I checked. I think it was 08C5 and 3840, which have since recovered, and 6703, which is still down, likely due to duplicates. So once it hit the 400 bad request enough times it stopped trying to transmit the reading? Or it continued trying but could never get to the next reading?

hrldcpr commented 8 years ago

Fixed by 405b09b - we just ignore 400s now, unlike 404s which cause us to skip all subsequent readings for that cell. (They used to be 500s, otherwise I probably would have initially done it this way.)

williamcodes commented 8 years ago

Woohoo!!

heatseeknyc / relay

relay server stopped relaying readings to main app #9