vapor-ware / synse-server

An HTTP API for monitoring and controlling physical and virtual devices
https://synse.readthedocs.io/en/latest/server/intro/
GNU General Public License v3.0
39 stars 11 forks source link

Synse v3 - Devices drop out of scan, then reappear. #362

Open MatthewHink opened 4 years ago

MatthewHink commented 4 years ago

Guessing what is happening is that when devices are being refreshed, the device collection is being cleared and rebuilt. This causes the scan to miss devices.

If that is what is happening, propose to rebuild the device collection in a new area, then flip the pointer to the rebuilt area once the rebuild is complete.

Sample scan with missing vem plc and ups devices: synse-scan-missing-vem-plc-and-ups.txt

edaniszewski commented 4 years ago

It looks like the deployment that the scan was off of is using the internal modbus plugin which is running into a connection reset bug, which is causing some failure with that plugin. I'll try to get that resolved tomorrow and I suspect that when it is, the scans will be fixed from that.

If it doesn't fix the scan issue, I'll look deeper into it here, but the caching is just in-memory (basically a dict) that has TTLs on the values, but a device refresh task (which rebuilds the cache) runs by default more frequently than the TTL. When replacing, the cache isn't cleared and then rebuilt, the cached keys are replaced as we get data for them, so shouldn't really be empty during a rebuild.

MatthewHink commented 4 years ago

Sounds like my idea was wrong. The ups is SNMP and not modbus so the linked issue seems different (?)

edaniszewski commented 4 years ago

Ah, woops. I thought ups was egauge which is also running with the modbus. If its SNMP I believe you are right and it is a different issue. I'll dig through some logs and see what I can come up with.

MatthewHink commented 4 years ago

I think this is a problem on device reads as well. Attaching log snippet / client side error / curl is woking now.

This looks like it's happening when rebuilding / updating the device cache and looking up device ID in cache.

We're serving up a 500 that is being logged at info not error with not much more information than that (no stack).

It looks like it failed somewhere after here: https://github.com/vapor-ware/synse-server/blob/54b032966e2361bb2031cb45e732c2dc20f8e30e/synse_server/cache.py#L210

synse-500-2020-02-19.txt

MatthewHink commented 4 years ago

This one is not hard to hit. It is very problematic in my lock testing.

If a repro is useful on an instrumented build with more tracing, we should get that setup and I can hit the issue.

MatthewHink commented 4 years ago

Between this and https://github.com/vapor-ware/synse-server/issues/363 I cannot normally make it through my lock write test. Considering these both priority 1.

test-stacks.txt

MatthewHink commented 4 years ago

These are happening on roughly a 3 minute interval.

MatthewHink commented 4 years ago

I believe what needs to be stated here is that if an http connection fails due to a networking issue, that's fine (cable fell out, router blip), but I believe that the core issue is that synse is not serving the url from time to time, which is a problem.

edaniszewski commented 4 years ago

Thanks for the logs/notes on this. I've been sifting through them this morning and am working on adding additional logging. It may take a bit since there is a bunch of stuff going on today, but hopefully by EOD/tomorrow, I'll have something with better log captures.

On first glance through the logs, it does seem like your initial diagnoses is correct in that the cache rebuild temporarily causes device lookup to miss.

edaniszewski commented 4 years ago

Was able to reproduce locally. When I got the 500, I got the following error message:

{
  "http_code":500,
  "description":"an unexpected error occurred",
  "timestamp":"2020-02-20T14:56:59Z",
  "context":"Task <Task pending coro=<Sanic.handle_request() running at \/usr\/local\/lib\/python3.6\/site-packages\/sanic\/app.py:976>> got Future <Future pending> attached to a different loop"
}

so it seems like there is something attached to a different async loop somewhere... not quite what I was expecting to see, but good info to move forward with.

edaniszewski commented 4 years ago

continuing the conversation from: https://github.com/vapor-ware/synse-server/pull/364#issuecomment-589219914

I didn't yet have enough time to go through and look closely at the logs and try and get a reproducible setup, but in skimming the logs around device cache rebuild, it looks like at different times, there are a different number of devices being reported, and for one instance it looks like the rebuild only polled devices from a single plugin. I'll look more into this later/tomorrow. I've started adding additional logging into synse around these areas as well.

edaniszewski commented 4 years ago

I think I see what is causing the remaining issues here now. I'll spend some time today working through a solution that should close this out.

edaniszewski commented 4 years ago

Nevermind to the above comment. I added some additional work to improve various things, but I don't think it will actually effect anything related to this issue. Looking through the logs again, its not clear why devices dropped out temporarily. More logging was added in the various commits, so if it happens again, it will hopefully be more clear.