Open adnbr opened 8 years ago
tagging @pbrook and attaching report. error.out.txt
Yup. EEPROM is full. Not easily fixed, I'll have a think about how best to do it. May require a bit of a rebuild.
@snowdenator is building physical hardware and looking into expanding with external EEPROM.
It would be good if the door/doord interaction didn't cause the door to think the server had gone away when the cache is full.
Whilst it did alert us to the problem, it meant neither '#' to open the door nor '#' to ring the bell was working on a Tuesday evening, which is far from ideal.
I'm inclined to say this falls into the "Don't do that" class of problems, at least from the perspective of the doorlock firmware. Failure is bad, but something has to give and I'm not sure any of the alternate failure modes are any better. TBH I'm impressed it failed as gracefully as it did.
Partially this is a symptom of changing requirements. At some point I will have calculated 1k/14 and decided that was a sufficiently large number :-)
As a side note, I'm not sure "cache" is an appropriate term here. It is a fundamental feature of the system, not an optimization - which IMO helps justify why asking for the impossible results in catastrophic failure.
If the eeprom is empty (eg first power up), does the code talk to the server to fetch the card details then store them locally? I'm thinking that if the card isn't found in the eeprom, then it must be downloaded and stored which opens up an option for a more graceful failure mode in that it talks to the door server for card checking should something happen to the eeprom and it loses cards
Part of the handshaking done when doord connects to the lock is uploading the card list.
So it'd require some kind of command or ability to query the server about a card, if in the case of EEPROM failure to get around that, although it does then leave an issue of what happens if doord is down and the EEPROM is corrupt/bad/on fire
A key design requirement is that core doorlock functionality continue working when the server goes down.
I'm not sure there's any point trying to workaround a corrupt eeprom, and we need to be able trust the hardware. Partly because I have no confidence in being able to predict what sort of failures are likely, or usefully recover form most of the ones I can imagine. In practice software errors are a much more likely cause of catastrophic failure. Keeping the firmware as simple as possible is IMO the best way of ensuring that.
The failure we experienced on Tuesday is when doord uploads the list of cards the lock (correctly) rejects the ones it can't store. This causes doord to abort the connection as something is clearly wrong.
I'm reluctant to change this behavior to add adding complex recovery paths. There almost never get tested, so in practice rarely do what you want when an error occurs. I've seen too many systems in only-sorta-working limbo after a "smart" recovery. Having the response to errors be "hit reset button and start from scratch" makes it much easier to convince yourself that you fully understand the implications of failure.
If we want to handle this case gracefully then we need to do it at a higher level. i.e. doord and/or the membership management system (what doesn't exist yet) knows what the limit of the doorlock hardware is, and applies some policy to decide who does not get reliable 24h access.
A suitably large external EEPROM should simply solve the problem, should it not?
An external eeprom should provide sufficient capacity for all realistic scenarios, yes. Adding it isn't trivial, but should be doable. I have got some on order.
I'm also working on some software tweaks that will give us a bit more headroom on the current hardware. Hopefully I'll get chance to try those out tomorrow.
Lock firmware has been upgraded, should be good for 100+ tags.
Downstairs lock will not talk to upstairs when it has a full cache.
@iMartyn has error reported from the door.