Closed williamcodes closed 8 years ago
Readings have stopped being transmitted again. It appears that the server thinks they have all been transmitted, however based on this relay hub page most readings have not yet been transmitted.
Are you sure all readings stopped being transmitted, or just that cell's?
That cell is getting a 400 from the main server, due to a duplicate reading:
> journalctl -ru batch
…
Dec 15 17:16:46 setup.heatseeknyc.com docker[982]: ERROR:root:request {"reading": {"verification": "c0ffee", "temp": 69.2, "sensor_name": "0013a20040d847e2", "time": 1450139524.0}} got 400 response {"code":400,"error":"Already a reading for that sensor at that time"}
Dec 15 17:16:46 setup.heatseeknyc.com docker[982]: INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): heatseeknyc.com
Dec 15 17:16:46 setup.heatseeknyc.com docker[982]: INFO:root:POSTing {'verification': 'c0ffee', 'temp': 69.2, 'sensor_name': '0013a20040d847e2', 'time': 1450139524.0}...
If we look for that sensor and time in the db, there is indeed a duplicate:
> bash relay/db/connect.sh
>> select * from temperatures where cell_id='0013a20040d847e2' and hub_time=to_timestamp(1450139524);
id | hub_id | cell_id | temperature | sleep_period | relay | hub_time | time | relayed_time
-------+------------------+------------------+-------------+--------------+-------+------------------------+-------------------------------+-------------------------------
30342 | 0013a20040d847e6 | 0013a20040d847e2 | 69.2 | 359000 | t | 2015-12-15 00:32:04+00 | 2015-12-15 00:32:09.087055+00 | 2015-12-15 14:30:58.721908+00
30344 | 0013a20040d8480d | 0013a20040d847e2 | 69.2 | 359000 | t | 2015-12-15 00:32:04+00 | 2015-12-15 00:32:12.241074+00 |
(2 rows)
So anyway, I don't know why there's a duplicate, but apparently they can happen. I think for now I'll just delete it, but if we see duplicates often then we'll need to think of an actual solution.
oh i only just noticed that the hub_id
is different on the otherwise duplicate readings, interesting…
so it would seem that swapping hub chips can somehow lead to a duplicate. strange.
anyway, reading is deleted, so everything should be transmitting soon.
Actually, there seem to be other duplicates from chip swapping, so strange…
select cell_id, hub_time, count(*) from temperatures group by cell_id, hub_time having count(*) > 1 order by hub_time desc;
Interesante. Was the 10 second pause between readings coming from throttling in the relay server code? If so, we might want to reduce that to 1 second. If not, we should check and see if we should get a bigger machine for the relay server.
According to free -m
total used free shared buffers cached
Mem: 493 479 14 19 52 247
-/+ buffers/cache: 178 314
Swap: 0 0 0
Looks like haven't hit swap file usage but we're getting pretty close. Would a bigger machine help?
I actually tried switching to a machine with 2x the RAM and it filled up almost immediately, which makes me think there's some flexible stuff going on (e.g. something is opportunistically allocating memory based on what's available?) but we should look into it. And perhaps either way we should just go for it, see if it helps.
I haven't had a chance to look at the logs for the 504s yet, but once we do we can hopefully figure out if it's related to RAM or some other wacky thing.
But anyway that's unrelated to this issue as far as I can tell, doesn't look like the batch
process ever died, it just gave up on cells with duplicate readings.
I checked several users before we deleted the duplicates and the relay server hadn't transmitted readings since 4am. So I guess that means they all had dupes? I wish I'd made a note of which users I checked. I think it was 08C5 and 3840, which have since recovered, and 6703, which is still down, likely due to duplicates. So once it hit the 400 bad request enough times it stopped trying to transmit the reading? Or it continued trying but could never get to the next reading?
Fixed by 405b09b - we just ignore 400s now, unlike 404s which cause us to skip all subsequent readings for that cell. (They used to be 500s, otherwise I probably would have initially done it this way.)
Woohoo!!
At around 4am this morning (Tue 12/15/2015) the relay server stopped relaying readings to the main app. The last reading came in at 4:14am from sensor FCD0. I restarted the server and they started transmitting readings once every 10 seconds.