lwa-project / analog_signal_processor

Analog Signal Processor (ASP) Monitor and Control Software
GNU General Public License v2.0
0 stars 0 forks source link

revH_compat - `asp_cmnd.py` fails when etcd fills up #9

Closed jaycedowell closed 2 months ago

jaycedowell commented 1 year ago

From the logs:

2023-09-27 06:04:21 [WARNING ] Could not get channel config. for board 8251: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.RESOURCE_EXHAUSTED
        details = "etcdserver: mvcc: database space exceeded"
        debug_error_string = "{"created":"@1695794661.429931894","description":"Error received from peer ipv4:127.0.0.1:2379","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"etcdserver: mvcc: database space exceeded","grpc_status":8}"
>

Blowing away the current etcd database fixes this but it will happen again.

jaycedowell commented 1 year ago

@ctaylor-physics you should probably be away that this can happen.

jaycedowell commented 1 year ago

This seems to have happened again.

jaycedowell commented 1 year ago

I tried this:

import etcd3

c = etcd3.Etcd3Client('127.0.0.1')

r = 1e12
for k in c.get_all():
    if k[1].mod_revision < r:
        r = k[1].mod_revision
r = (r//100)*100

c.compact(r, physical=True)

The database size, i.e., c.status().db_size, didn't show a huge change. Adding in a c.defragment() after the compaction lead to a much smaller database size but INI still doesn't work.

Update: SHT then INI also doesn't work. Update: Neither does restarting the ASP MCS service. Update: Neither does restarting the etcd service. Update: Neither does restarting the machine.

Final Update: The secret seems to be that after freeing up space you need to clear the "NOSPACE" alarm with ETCDCTL_API=3 etcdctl alarm disarm. This might could have been done as part of that Python sequence by throwing in a c.disarm_alarm() after the defragment call.

In any case I think the path forward is to add some kind of daily/weekly maintenance into asp_cmd.py. Something compacts all but the N (maybe N=1000?, 10000?) most recent revisions, defragments, and then does an alarm clear for good measure.

jaycedowell commented 1 year ago

Added compactEtcd.py to run every Sunday to root's crontab.

jaycedowell commented 8 months ago

This is still a problem.

jaycedowell commented 2 months ago

Fixed with #13.