unitdata: make keeping revision data optional (#736)

xmkg commented 2 years ago

The unitdata code keeps variable revisions in the kv_revisions table. As a part of that, all of the environment variables are serialized into JSON and pushed into the table if current set of environment variables differ from the previous ones. But the code considers environment variable list as a whole, and even if a single environment variable changes, the whole list gets pushed as a revision. Some of the environment variables (e.g., JUJU_CONTEXT_ID) are always changing, which means whole env variable list is serialized and pushed to the DB on every single hook invocation. This has become a problem in deployments with large (e.g. ~70 KiB) environment variable lists. Given that update-status is running periodically (every 5 mins by default) and many charms running in the same environment, the disk space of the host running the charms runs out over time, raising a need of manual intervention to clear up the kv_revisions table.

This fix makes keeping revisions optional and disabled by default.

Closes-Bug: LP#1930173

ajkavanagh commented 2 years ago

This is a reasonable change, but I wonder if we need to keep update-status changes for a year? I'd also be interested in the use cases for keeping this data at all. It's nice to have an audit table of all the changes, and I can think of lots of use cases for it, but what is it actually for? i.e. does anyone use the history?

xmkg commented 2 years ago

but what is it actually for? i.e. does anyone use the history?

My thoughts exactly. I've opted in for a softer approach because I don't know the use cases and whether anybody uses it or not. Keeping the data around is good for diagnostic and debugging purposes, but right now, I don't see any point other than that. I've thought about many destructive and non-destructive solutions for the problem, but it is pointless without knowing the use cases.

lathiat commented 2 years ago

I had a look at this, while various charms make use of the configuration or relation diffing part of this functionality I could not find any code actually checking the value of the environment variables. This code is super old from 8+ years ago and it always stores all 3, so I suspect it was just added as a possibly good idea at some point and never re-visited.

In pretty much all cases I can find, this storage code is actually used by calls to "with unitdata.HookData()() as t" which gives you back an object from call which gives you t[0] = instance of unitdata.Storage t[1] = delta_config t[2] = delta_relation

The delta_config and delta_relation use the config and relation parts respectively, but not the environment. It seems the only way to explicitly check the environment state is to call t[0].get('env'). While there is plenty of code doing get/set for other names, I cannot find code doing that in any of my checked out charms (basically every openstack charm, but not any others) grepping for combinations of "get" and "env", though it's a little harder to search all of github for this pattern. Does anyone have a good idea for somehow trying to search for this call pattern in all of the published charms? and/or any other ways people are likely to call into that function. I guess double quotes is one option.

While researching and understanding this code, I found that it seems that every call to "with unitdata.HookData()() as t" will create and then persist the hook data into the sqlite file. Most users of those API are calling unitdata.HookData()() every time it's used not using a global value - mostly because it's called from other charmhelper functions. While the environment doesn't change during the same hook invocation, so that large set of data is only stored once, the hook itself is recorded multiple times. For example the cinder charm calls the flush function 3 times (and thus 3 entries of hook-status are recorded)

In the case of the cinder charm, during config_changed it calls charmhelpers.contrib.openstack.utils.is_unit_paused_set twice which persists it's own custom "kv" value using kv.set/get('unit-paused') and then also calls charmhelpers.contrib.openstack.utils.config_value_changed. This helper function does it's own db.get and db.set with the name of the config value - and ignores the fact that the Hook storage had already stored a separate copy of the config and returned the delta of that value having changed in t[1]. Or at least, it seems like it does, but as I'll discuss below, it actually doesn't.

Every call site in my check out set simply uses t[0] and never checks t[1] or t[2]. Everyone is get/setting their own key and not using -any- of the standard ones.

def config_changed():
    # if we are paused, delay doing any config changed hooks.
    # It is forced on the resume.
    if is_unit_paused_set():
        log("Unit is pause or upgrading. Skipping config_changed", "WARN")
        return
...
    # overwrite config is not in conf file. so We can't use restart_on_change
    if config_value_changed('overwrite') and not is_unit_paused_set():
        service_restart('cinder-volume')

However I then looked at this value of t[1]=conf_delta that you get from HookData()() and it always shows previous=None - e.g. it didn't have a copy of the old config to compare to - so it alwasy claims to have changed. If you look at the sqlite db.. the configuration is never written to the file. I guess? this is a bug/oversight in the original code or was at some point broken. Bute can see from _record_hook that it stores the 'env', 'unit' and 'relid' but never stores 'config' or 'rels' so these delta functions never actually work. The documentation for the delta function does not you need to persist the value yourself.

    def _record_hook(self, hookenv):
        data = hookenv.execution_environment()
        self.conf = conf_delta = self.kv.delta(data['conf'], 'config')
        self.rels = rels_delta = self.kv.delta(data['rels'], 'rels')
        self.kv.set('env', dict(data['env']))
        self.kv.set('unit', data['unit'])
        self.kv.set('relid', data.get('relid'))
        return conf_delta, rels_delta

Another detail here is that depending on the implementation, multiple attempts to check if a config value changed in the same hook may return a different value. Indeed the current implementation of charmhelpers.contrib.openstack.utils.config_value_changed only works the first time it's called. If multiple caller sites try to check the same config value, the second call will compare to the value stored during the first call - and that is probably not expected. It is likely for this reason that the delta functions does not persist/flush the value for you (as documented on line 133) - it needs to be done at the end of the hook otherwise you'll compare to the wrong value.

This was a bit of a brain dump and a lot of info but I think the TL;DR here is that this API is a complete mess and it seems unlikely that anyone is really using any of the built-in storage since both config and rels which you're more likely to want don't actually work - and I just don't see what info will be in the juju hook execution environment that you want to track a change of.

So while this change is brilliant in trying to preserve compatability I wonder if we can just delete/remove the env tracking entirely.. even if we don't it seems like a good case for it being OK to accept the change. But if someone wants to try grep/search a larger set of charms I am all for it.

ajkavanagh commented 2 years ago

Based on @lathiat (extensive) analysis, I think we are in a strong position to just stop storing the env data at all. The downsides (running out of storage) greatly outweigh the upsides. If people think that's too strong, then an option to store it or not, would be reasonable (defaulting to not storing it). In all the years I've been writing/debugging charms, I've never used this data, but that's just my anecdote.

xmkg commented 2 years ago

Based on @lathiat (extensive) analysis, I think we are in a strong position to just stop storing the env data at all. The downsides (running out of storage) greatly outweigh the upsides. If people think that's too strong, then an option to store it or not, would be reasonable (defaulting to not storing it). In all the years I've been writing/debugging charms, I've never used this data, but that's just my anecdote.

I second this. If there is any use for this, anyone can still opt-in and continue using the old behavior accepting the implications explicitly.

So I am going to change proposal basically to:

def set(self, key, value, keep_revisions=False):

instead of

def set(self, key, value, delta_revisions=False):

... or alternatively, in Storage constructor:

def __init__(self, path=None, keep_revisions=False):

I am more inclined towards the second approach (taking the option in the constructor). @ajkavanagh, any opinions?

xmkg commented 2 years ago

And one more thing, should we add a cleanup code for existing deployments (e.g. in charm upgrade hook)? Otherwise, we might have to deal with each existing deployment manually.

ajkavanagh commented 2 years ago

I am more inclined towards the second approach (taking the option in the constructor). @ajkavanagh, any opinions?

Yes, I would agree with that.

And one more thing, should we add a cleanup code for existing deployments (e.g. in charm upgrade hook)? Otherwise, we might have to deal with each existing deployment manually.

That might be tricky. I'm assuming this would be a delete/vacuum type thing to remove all of the env variable changes and compact the db file? I'm trying to think if there would be any downsides. I would lean towards automatic clean-up, though.

xmkg commented 2 years ago

I'm trying to think if there would be any downsides. I would lean towards automatic clean-up, though.

@ajkavanagh alright, I'll propose clean-up in another MR so we can discuss it separately. Made the changes, ready for review.

ajkavanagh commented 2 years ago

So I'm a :+1: on merging this; anybody else feel it's a step too far?

juju / charm-helpers

unitdata: make keeping revision data optional (#736) #737