Persistence of rolling codes

rgr101 commented 4 months ago

Describe you new feature you'd like

How are the sensitive rolling codes protected from a sudden power interruption?

I have delved a little deeper into the code, but so far I have only discovered the backup/restore function, which, as far as I understand, is explicitly called on demand by the user in the GUI. If there is already another, invisible automated functionality for securing the persitence of the rolling codes, I am happy, if not, I see an opportunity resp. need for improvement here. I have some implementation ideas for this (including protection against flash killing), which I would be very happy to contribute to.

rstrouse commented 4 months ago

Every rolling code is also retained in NVS. This is how you can restore an old backup without destroying the sequential codes.

rgr101 commented 4 months ago

Ah I see, sry, I must have misunderstood the correspondent passage in the backup&restore section Honestly I had already guessed that, as you really seem to have thought of everything in your masterpiece 👍

Still - may I ask in which way and what cycles the rolling codes are written to the NVM? (I'm a bit worried if the controller will really last 10 or 20 years without the flash wearing out - I've made some bad experiences with deceased SD cards in Raspberry PI's and my Loxone Server...)

rstrouse commented 4 months ago

They are transactional. The NVS uses wear leveling so you have nothing to worry about here. If a failure occurs it will not be because it wore out the NVS. It will wear out your motors before it commands enough times to even get close to the block marking. The LittleFS persistence is only written when on cycle and only when it is dirty. So it only persists when a movement has completed or any of the configuration is saved. This is part of the reason I do not write logs to Littlefs.

rgr101 commented 4 months ago

Sure. I think you should be able to count on the wear leveling of the littleFS (I assume it will be utilising Espresso's wear levelling API?). And you're right: I once extrapolated (under the pessimistic assumption of only 10000 write cycles until flash wear out) that a life span of 20 years allows more than 600 movements per day! That should be far more than enough for my 22 shutters, which now let me sleep peacefully 😅

Anyway - just in case it becomes a topic (e.g., if Espressif's wear levelling doesn't turn out to be quite as perfect...) - My idea would have been a relatively simple mechanism that utilises the rolling code tolerance (of ~+100?), of course minus adequate security, and triggers the write operation to the NVS only after "enough" movements have accumulated (still to be determined, e.g., when the first of all blinds has performed more than a tolerable number of movements since its last backup), and then the rolling codes of all shades are written to the NVS in one transaction... - I believe this could reduce the number of write operations by more than a factor of 100. Just something to consider, in case it becomes necessary someday...

Something else - regarding the impending memory shortage issue: You noted that increasing SOMFY_MAX_SHADES can cause you to run out of memory. Conversely, does this mean that decreasing this value (I only have 22 shades and will never have more) would save memory and thus increase the stability of the system (probably not much, but still)?

rgr101 commented 4 months ago

(Of course, after power on and restoring the last backup, the rolling codes would need to be increased by an offset of that tolerance value + x, since it must be assumed that additional unsaved movements may have occurred since the last backup, thus ensuring that the next transmitted rolling code is sufficiently high (for this, the "+ x" will be meaningful for additional safety) but not too high. The tolerance value still to be defined must be small enough to ensure sufficient distance from the assumed RTS tolerance of "approx. 100". On the other hand, it should be large enough to yield a worthwhile effect -> I believe a good value will likely be between 10 and 20...)

Oops, it just occurred to me a potential overflow of the 2-byte rolling code! Admittedly, 64k movements may sound like a lot for a single shade, but projected over 20 years, it suddenly becomes a realistic consideration... but I'm sure you've already cared about that :)

rstrouse commented 4 months ago

While the ESP32 wear leveling is not perfect in that it does not trim() the other components on the board are more likely to fail over time. Eventually the motor will reject the rolling code if it is not correct and blacklist the address so incrementing the rolling code will cause the commands to start failing.

If you did 600 movements per day, you will be replacing the motors long before the ESP gives up the ghost.

The rollover of the rolling code is handled. These are actually 15 bits not 16 although they are unsigned and roll over to 0. Decreasing the number of shades will not change anything as far as stability since there is ample memory at 32 shades 16 groups and 16 rooms to allow for the call stack.

rgr101 commented 4 months ago

No no I meant 600 NVS writes per day = 600 total movements of all blinds if written after each movement —- anyway still very many per blind 😎

(One 4K block wear leveled on assumed 2M memory * 10,000 / 20 / 365; without counting addt’l config backups)

rstrouse commented 4 months ago

Yep you should be fine. Having written a few bits of software for the pi I can tell you the killer of SD card isn't so much the number of writes it is the frequency of them. The heat generated from hammering the card repeatedly for an extended period of time is a card death sentence.

The piOS does enough of that on its own but when folks try to log outside of a batch write it is inviting failure. The pi5 however, has an nvme interface that will negate all of that. For my HA instance, I use a mini-PC with an M.2 nvme.

rstrouse commented 4 months ago

Good discussion. There are a lot more discussions about how one comes up with the duty cycle but know that every 4k block is not written on each transaction.

nbarrientos commented 4 months ago

Hi @rstrouse, @rgr101,

Sorry for making noise in a closed issue but searching for information I ended up here and my question relates very well to what has been discussed over here.

I saw in the documentation that the last rolling code for each shade is part of the backup. In case of total hardware failure (ESPSomfy-RTS would have to be installed in a totally new device) recovering from a backup might not leave the system in a working state as the rolling codes might have a big offset to what's known to the motor, is this correct?

If that's the case a way to have some kind of protection would be to do backups periodically and store them in a safe place. Is there an API to externally invoke the backup operation and download the file programmatically? HTTP API via cURL or similar?

Thanks.

rgr101 commented 4 months ago

Hi @nbarrientos,

This is what I understood so far:

the last rolling code (of a drive, which of course is the same for all configured drives) is written to LittleFS after each operation, so the controller is insensitive to power failures etc.
During backup, the last rolling code is also written to the backup file.
As long as you do not change the hardware, restoring (any) backup will work, because the application compares the rolling code of the backup with that of LittleFS and takes the higher one.
As long as you have a planned hardware change in mind, this is not a problem, as you can make a backup with the current rolling code just before the change.

Yes, I agree with you: If you have to completely rebuild the controller after a total failure, the up-to-dateness of the backup is crucial, because the LittleFS of the new controller is initially empty and only the rolling code of the backup can be used. So if the backup is very old, the rolling code stored there will most likely be too small and therefore out of sync. Apart from the regular backup, I can't think of anything really intelligent to do here - except perhaps to carefully patch the rolling code in the backup file (which is an editable text file) to a value increased by a maximum of 99 (if the rumor is true that the rolling code tolerance is +100). But this is by no means a sure thing, and would only work approximately if the backup file is not too old...

Another alternative would be to reset the rolling code on the drive itself. However, this does not work on all drives and has not been sufficiently tested or verified (see #293).

rstrouse commented 4 months ago

Actually the last rolling code is also stored in the NVS partition. The problem is that if you do a total rebuild the NVS is erased prior to recreating the partitions. There is no way around that on the ESP32.

However, there is a way to get a periodic backup. This can be done with the backup entity in HA or by sending it GET http://<ip address>/backup

If you want it in a response that is just a text output of the values GET http://<ip address>/shades.cfg

nbarrientos commented 4 months ago

Thanks! I'll schedule daily backups, then. Maybe it's worth adding a simple recipe to the documentation, something like:

wget --content-disposition http://espsomfyrts.lan/backup -P /backups/config/espsomfy --quiet

For info, even though my instance of ESPSomfyRTS is protected by username/password, the /backup endpoint dumps a full backup without any authentication needed. Not expected, I'd say :)

rstrouse / ESPSomfy-RTS

Persistence of rolling codes #274

Describe you new feature you'd like