We should keep AppCatalogEntries for all apps that are in use

marians commented 2 years ago

User story

As a user I want to be able to retrieve details about the app version in use, based on the AppCatalogEntry.
As a user I want to view the entire app release history in the web UI.

Details

In the web UI we offer details for every app installed in a workload cluster. Example:

Here, in the dropdown labelled as INSPECT VERSIONS, we display details from the AppCatalogEntry resource, for the currently installed version as well as for other available versions.

UX problem

In the case of an app version being installed that has no according AppCatalogEntry, this information is missing. Example:

Also by navigating into the Apps section, a user can no longer find any details about the installed version 1.16.1. The user also cannot find out which releases existed since the installed one and the latest one, and according an upgrade to the lastest v1 minor is not possible.

Solution

Provide AppCatalogEntry resources at least for all app versions installed.
Fetch app release history from a different source again?

marians commented 2 years ago

@gianfranco-l This appears to be quite severe to me. I'd be happy if honeybadger could take a look.

gianfranco-l commented 2 years ago

it sounds like some tech debt is showing up in the web UI: @piontec will describe the root cause of the problem, the long term solution and the workaround that he sees and we will discuss them async with @giantswarm/team-honeybadger . @uvegla will drive this and ensure that this is taken care of and implemented.

uvegla commented 2 years ago

A couple of things we talked about / I have in mind off the bat:

Knowing if something can be cleaned up or not needs to know of / have a look at all the App CRs.
- Reading all the App CRs can be very resource intensive
- It came up we could use caching, I think the problem with this could be that the App CRs are split to multiple app operators
- It came up we could maybe use finalizers.
- There are multiple app operators managing resources but it is the unique one that manages catalog entries (create / delete), so each of the app operators should have their own unique finalizer
- It came up to fall back to reading the metadata yaml directly from the repository (for example) but apparently it came up before and was discarded
- It came up that the ACEs could have an in-use field set in the status. It is easy to set, but how should the operator (multiple ones at the same time) know when to unset it / set it to false?
What about resources that are already deleted? Should be there some logic that adds them back?

uvegla commented 2 years ago

I feel that by the nature of App Platform we can only come up with an eventually consistent solution.

Some facts:

App CRs are reconciled by multiple app-operators
One of these app-operators, the one in the giantswarm namespace is the unique one. This is the one that currently reconciles catalog and ACEs. Therefor it does not naturally touch / know of all App CRs
There are catalog entries that are already deleted
The Apps themselves do not require the existence of the ACEs to get deployed, this is purely for the Web UI

Some thoughts:

We either keep it central in the unique app-operator
- PRO This is already the one, single point - we even label ACEs that they are managed by the unique app-operator
- PRO It is enough to update these app-operators
- CON Periodically inspecting all App CRs in the unique app-operator might take too long or send too much request to the Control PlaneAPI servers
Or we split it into all app-operator
- PRO We split the computational burden
- CON We break managing centrally
- CON It depends on all app-operators everywhere to be bumped to the given version or above

Some ideas

Each app-operator sets a unique finalizer on the ACEs when they encounter it being used on an App CR
- all pros and cons of the we split it into all app-operator above
- CON when the app-operator is removed it has to look at all the ACEs and remove it's finalizer (not sure we can do it with operatorkit)
- With labels we might be able to speed up the lookup, but this could end up bloating ACEs with a lot of labels as the label key has to be unique to the app-operator and the value does not even matter
We ideated about replacing how we manage Catalogs and ACEs in MCs, this might be a good opportunity to do it
- CON probably takes the most effort and since we do it decoupled from the App Platform review, we might end up with a suboptimal solution in the long run
Create a new reconciliation loop that does something only in the unique app-operator that:
- Processes all App CR changes
- Somehow cleverly utilizes go routines to decouple / not block the operator
- Idea here is that the whole thing could be done in a lazy manner, it is probably ok to slowly make the system consistent
- Probably still need some delays / breaks to not flood the Control Plane API
Use a watcher that runs in the unique app-operator only - similar to this: https://github.com/giantswarm/app-operator/blob/a655ea07f8b5dbf4520397f5b20b1afae0cf3a0b/service/watcher/appvalue/appvalue.go - that watches all App CR changes and acts on the ACEs
- Still need to prevent the catalog reconciliation loop to clean up what should not be

piontec commented 2 years ago

Some long-term perspective, mainly for @marians. When we first designed ACEs, we made an assumption that we represent each release as a separate ACE, so it's easy to manage (add/remove) them without changing too many objects at once (and with super easy logic). This has come back and hit our back, as in general we would produce too many ACEs and kill the API server performance. Because of that, we decided to limit the number of ACEs to the X most recent releases (currently, AFAIR, X = 5). This has some serious side effects as well, as you see now with happa.

As a result, we know we have to rework how ACEs are managed, but we really don't have time to work on it. Hence, we're looking for some solution here.

piontec commented 2 years ago

From my point of view, mostly based on the comments above, I see these 2 solutions:

Multi-finalizers

When a per-cluster/org app-op creates an instance of App CR (a Chart CR on a remote cluster), it adds its own finalizer to the ACE. It also adds info about how many times this specific app-op has deployed this app. If the counter ever reaches 0, this app-op removes its own finalizer. Now, the unique app-op, when it finds it fit to remove the specific ACE, just issues delete and that's it. When usage counters of all the app-ops will drop to 0, all app-ops will remove their finalizers and the object will be removed from k8s API server.

Pros:

No need to periodically scan all the App CRs anywhere. Individual app-ops just have to keep count of how many times each of them has deployed this version of ACE
doesn't change anything in App CR nor ACE CR

Cons:

implementation is not that simple anymore

Ideas:

instead of just counting how many times each ACE was deployed by each app-op, we can just make a full deployed-by: list in the status part of ACE. This list will include a reference to each App CR this ACE-version is deployed in. This has some extra benefits in clarity and building relations between objects, not controllers that handle them. It can also instantly answer questions like "where is this version deployed?". I like this much more than a cryptic counter of deployed-by-app-op-0.2.3-count: 3. In this case, an ACE can be removed when the deployed-by list is empty.

Introduce a new CR "done right", tentative name CatalogEntry

This would create a CatalogEntry per application name, but not version. For each application, versions and their properties will be held within a single CE as a list of entries based on the current status of ACE. Entries on the list always reflect what is in the Catalog.

One more thing we can do right if we go with this, is we can extract this to a separate operator and use kuebuilder for it already. This will limit the difference in how unique and non-unique app-ops work, it will start a migration from operatorkit that we need to drop anyway and will be a good intro for us for kubebuilder anyway. It will also make it easier to potentially replace or modify heavily App CRs and app-op in the future.

We risk here that we miss some even-bigger-scope of app-platform, that might show up after the app-platform review, but I'm OK to proceed with it because:

it's the best idea we have right now, and it would go into "best vision of app platform" with 95% certainty
It keeps the loose binding between App and (A)CE. Thinking in perspective: even if we totally remove App CRs from app platform and start using HelmRelease from flux directly everywhere, that kind of CE is still useful, as it's not bound to App nor HelmRelease. So this can withstand even big changes. Just improving ACEs with a list of App CRs they are default with doesn't offer that.

All in all, I think I'm for option 2, as it's more "right" and "long-term", but I'm also assuming that implementation of "option 2" is not much bigger than "option 1" anyway. Do you agree? I see them both as pretty big now :(

Also, @uvegla let me know if you agree that these 2 options make the most sense... maybe scanning all the App CRs when you want to "just" remove an ACE is not that bad. I'm not sure about operatorkit, but in kubebuilder client's request are definitely automatically cached.

uvegla commented 2 years ago

@piontec I like option 2 the most, but it also takes the most effort.

FTR this is what I came up with after digging around:

We add a new watcher that runs only in the unique app-operator - so we only need to update that one - and it runs the following algorithm:
- Start a watcher on all App CRs (but do not start reading from the channel)
- Get all App CRs and create a map of the current apps -> catalog, app, version
- Make sure all of them has a finalizer like giantswarm.app-operator/<APP_CR_NAMESPACE>.<APP_CR_NAME>
- Start processing events
- If an App CR is created, add it to the map
  - If the ACE exists, add a finalizer
  - If not, create it and add a finalizer (reading the index might take time here)
- If an App CR is modified
  - Check if any of the catalog, app or version changed
  - If yes, remove finalizer from old and add to new one (create new one if not exists)
  - If not changes, continue (or make sure finalizer is set, but not necessary I think)
- If an App CR is deleted
  - Based on the map, remove finalizer from ACE
  - Remove from map

Additionally the current clean up logic should be modified to keep ACEs with finalizers.

Some corner cases:

We are testing a new version of an App so we manually modify its catalog and version to that. The finalizer gets removed, the cleanup logic deletes it. This is why the create logic should be added to the above algorithm, so that when we set back the App it will surely recreate the ACE.

This is not perfect tho, for example:

Some events can be lost when app-operator has issues
Some ACEs can get stuck when app-operator has issues
- For example app-operator marks with finalizer, app-operator fails / gets killed. Before app-operator comes back, the App CR changes it catalog/app/version so the given ACE sticks around forever

This is all a little complicated tho. @marians what is the urgency on this? If not super urgent, I would rather go with the decouple into a new operator and we do it when we get there. WDYT?

uvegla commented 2 years ago

My other thought is that letting multiple app-operators updating labels, list, counters whatever could end up in a serious, non-deterministic mess cos there is no guarantee whatsoever when / what order your writes get processed and if the resource changed between the read and the write in a single operator instance's perspective.

marians commented 2 years ago

There is no particular urgency on this. It's just a general problem with our app user experience.

ljakimczuk commented 2 years ago

My 2 cents. I may be controversial, but maybe we should wait with implementing something bigger till we have more time to do the architecture review and re-design. I mean, I like the 2nd idea because it gives the most holistic view on what's inside the Helm repositories, but also I don't like it because it feels like a lot of fuss to get the app metadata from repository. In such case I like the 1st option better, as it feels even more Kubernetes-like, I mean the finalizers, than the operator. But it also comes with its potential problems and I get it.

My other controversial thought is, do we need the ACE for installed apps at all? I mean, we get 5 latest ACEs created by unique app-operator and that's fine, we need to know beforehand what is offered from the catalog. Now, because the view on the screenshot presents the WC-scoped view, I presume the drop-down list contains the version of the app installed for this workload cluster, plus the versions available coming from ACEs, is that right @marians ? Meaning this list does not have the versions installed for other workload clusters, right? If this assumption is right, do we need to create the ACE for such, already installed app? App Operators should already have access to the index.yaml, see for example this part, and hence to all the information needed on the list. So, let's cut to the chase - can't we put this info into the App status? Better yet, the status already has the appVersion corresponding to upstream version on Happa's list, and the version I presume is already got from there, so we would need to add the release date only. If other information is needed we could also drop it there, after all ACE contains mostly the same fields as App CR.

uvegla commented 2 years ago

I kinda feel we should wait with this too. I think adding the info the App status is interesting, tho this way you kinda replicate the info all over the place.

On the other hand the Catalogs and the ACEs are also used in Happa for browsing the available Apps - /apps endpoint - and checking their READMEs, for example /apps/giantswarm/flux-app/0.15.1 which are nice features. Maybe those could also be solved in another way but having Catalogs and ACEs in some form in the cluster is nice to some extent, cos I dont have to "go away" from the cluster / know where these are coming from to see what is available (at least for now the last 5, well...).

marians commented 2 years ago

I just added info on another broken use case "As a user I want to view the entire app release history in the web UI." to emphasize that the catalog detail page's function to allow viewing the entire release history is also broken.

piontec commented 2 years ago

I think that we agree that we need a major change here and we won't be able to execute it anytime soon. As such, I really like the simple solution @ljakimczuk proposed: let's add the missing info to the App CR itself, on its creation. That way happa has all it needs and we won't have to do any big changes. Yes, we're replicating some info and potentially risking some inconsistencies, but I think we can live with that.

As for the full story of 'release history in happa', this will have to wait for the bigger rework, I'm afraid.

marians commented 2 years ago

Sounds good to me

weatherhog commented 1 month ago

@marians as this seems to be happa related, is this something that is still valid for the future and backstage? Or can this be closed?

marians commented 1 month ago

After glancing over it, I think this is going to remain valid once we port the function to upgrade an installed app to Backstage.

giantswarm / roadmap