Closed absoludity closed 2 years ago
Neither of these approaches would not speed up the 1st call, but either one would make the cost of 2nd and subsequent calls to
GetAvailablePackageDetail
insignificant.
Yep. Another possibility is to do something similar to helm where a job go-routine is started and caches the files for the latest few versions of each chart. That way clicking on an available app summary would show the details very quickly, where as if the user then switches to an older version they get a 2s delay which is OK.
Neither of these approaches would not speed up the 1st call, but either one would make the cost of 2nd and subsequent calls to
GetAvailablePackageDetail
insignificant.Yep. Another possibility is to do something similar to helm where a ~job~ go-routine is started and caches the files for the latest few versions of each chart. That way clicking on an available app summary would show the details very quickly, where as if the user then switches to an older version they get a 2s delay which is OK.
yeah, I've thought about that one yesterday too. I thought our helm support caches all chart versions. If I remember correctly, bitnami repo has around 14K versions and I just thought 'heck no': caching 14K things when in most cases you might need just one or two packages is so bad signal-to-noise wise. But also, if I remember correctly, there something like 100 or so distinct packages in bitnami repo and so caching 2-3 latest versions would result in 200-300 things being cached, which is not great but of course much better than caching 14K. So I'll consider this approach as well
Side note: this is where exact implementation of 'caching' is actually pretty important. Its not so bad to have 14K rows in a SQL table (on disk). Who cares if you one ever need one or two rows - disk space is cheap. But the same set of data stored in memory in process is kind of bad - process memory footprint is very important
I've been thinking about this on and off past few days. Here is where I currently stand (hint: not quite where I'd like to be). One could argue there are two use cases here to speed up: 1) "first click", when a user clicks on the "Catalog" and then clicks on one of the packages to get the details for the very first time 2) "second or subsequent click", the second or subsequent time the user clicks on the same chart to get the details again.
Ideally, I would really love to solve (1), which obviously would also solve (2). It's kind of a "first impression" thing. I don't want there to be an (however misguided or superficial) impression by the end user that flux is somehow slower than say direct-helm and therefore "inferior". However, I can't think of an elegant solution to (1) yet :-( Here is why: I don't want to go down the same path direct-helm plugin uses to address this issue, which is to download and cache all the charts locally, all done in the background once the repo has been added. Even if we are talking about having 2-3 latest versions of the charts it's still hundreds of charts cached for bitnami repo. Why don't I like that? Because in my mind, the "proper" way to get charts with flux is to create a HelmChart CR, wait for it to reconcile and then get the chart tarball cached locally in the cluster. I don't like this idea of having to create, say 300 CRs when user adds bitnami repo. As in, at the end of this one could do "kubectl get helmchart"
and see 300 entries. Yikes. It "feels" heavyweight and bad. There could be all sorts of hooks or watchers configured to react to new CRs and I would just overwhelm them.
I've thought about skipping HelmChart CRs and download tarballs directly as linked to by the index.yaml and cache their contents but that also feels wrong because I am bypassing a layer that, in theory, is there to make life easier.
If I had to solve just (2) by itself that I can do fairly elegantly. My preference would be to implement this garbage collector that deletes all but N least recently accessed charts. It would be a relatively simple solution to implement and light on process footprint too.
But again, I really want to solve (1), just haven't found an elegant way to do it yet. Thoughts are welcome. Cc @dlaloue-vmware
I'd say this issue is totally applicable to the helm-direct plugin too. Currently, we are solving (2), but the delay is only noticeable when installing/upgrading a chart for the very first time.
I've just started looking into this issue today, so I don't have still a good foundation for the whole thing. However, I'm sharing some thoughts:
I haven't had a chance to think about it carefully, but it should quite odd to me that we do have CR for the appRepos spinning up several jobs... and we end up calling and parsing the index.yaml by hand each time we deploy something.
I don't think we should cache everything indefinitely in memory as we are currently doing; at least we should implement a kind of LRU cache to prevent the memory to increase in O(#repos*avgRepoSize).
So this issue is pretty similar to what I also wanted to solve (eg. see https://github.com/kubeapps/kubeapps/pull/3677) (well, mine is less ambitious, just reducing the memory consumption in our current scenario), so I'm also really interesting in the thoughts from you all here :P
Ideally, I would really love to solve (1), which obviously would also solve (2). It's kind of a "first impression" thing.
+1, if possible, but there may be other ways that involve perceived impression vs actual time taken.
I don't want there to be an (however misguided or superficial) impression by the end user that flux is somehow slower than say direct-helm and therefore "inferior".
The premise of thought wouldn't even enter my mind... let's just focus on helping each other improve the product, you shouldn't be worried about this, IMO.
I don't like this idea of having to create, say 300 CRs when user adds bitnami repo. As in, at the end of this one could do
"kubectl get helmchart"
and see 300 entries. Yikes. It "feels" heavyweight and bad. There could be all sorts of hooks or watchers configured to react to new CRs and I would just overwhelm them.
Totally agree - this doesn't appear to be an option that can be considered. Even if we added them a few at a time, cached, then removed them, it'd still generate all the churn.
I've thought about skipping HelmChart CRs and download tarballs directly as linked to by the index.yaml and cache their contents but that also feels wrong because I am bypassing a layer that, in theory, is there to make life easier.
Not 100% sure about this one. That layer (flux) is there for installing packages from repositories. It's not there for displaying a browsable catalog of packages. So I don't think that layer is necessarily there to help you solve this problem. I wouldn't see an issue with a go-routine that begins extracting and caching the readme etc. of the latest version for each package (only).
But again, I really want to solve (1), just haven't found an elegant way to do it yet. Thoughts are welcome. Cc @dlaloue-vmware
Just some thoughts about perceived time: If we did have (just) the metadata of the latest chart version for each package in a repo already cached, the initial display would be without delay and if the user changes versions, we can update some data on the page immediately (version mainly) while replacing the other metadata in the background, so that it's much less of an issue (ie. not waiting 2s while a spinner spins).
I've thought about skipping HelmChart CRs and download tarballs directly as linked to by the index.yaml and cache their contents but that also feels wrong because I am bypassing a layer that, in theory, is there to make life easier.
I agree with Michael on this. The HelmChart has a function for the HelmRelease. I do not think you are really using any of its features. I would think it is ok to bypass the CR.
Just some thoughts about perceived time: If we did have (just) the metadata of the latest chart version for each package in a repo already cached, the initial display would be without delay and if the user changes versions, we can update some data on the page immediately (version mainly) while replacing the other metadata in the background, so that it's much less of an issue (ie. not waiting 2s while a spinner spins).
I was thinking something along the same lines. In the UI, a user will select a package from the catalog. This means the UI already has all the "summary" information. we could make sure that the UI can start rendering the page with the summary first and then render the details when they are available. If the chart itself is already cached, the details should be coming quickly. If it goes beyond a threshold, the UI can display a message that the chart is being loaded. I think this is acceptable as users know the chart is located in a remote repository.
helm cli is doing the same. It caches the index.yaml, and then pulls in the tar files only when needed.
Thank you everyone for your feedback. Some specific things:
I'd say this issue is totally applicable to the helm-direct plugin too. Currently, we are solving (2)
I was under impression helm-direct was solving (1) as well
I don't think we should cache everything indefinitely in memory as we are currently doing
Are you talking about direct-helm? I wasn't aware of any cache there that stores things in memory
So this issue is pretty similar to what I also wanted to solve
I don't know about that. This issue doesn't have do with with parsing repo index. Are you referring to data stored in postgresql or some other cache I am not aware of?
The premise of thought wouldn't even enter my mind...
Wasn't your mind I was worried about. End user or customer or anyone in general that doesn't know how "the sausage is made", so to speak
Not 100% sure about this one. That layer (flux) is there for installing packages from repositories. It's not there for displaying a browsable catalog of packages. So I don't think that layer is necessarily there to help you solve this problem. I wouldn't see an issue with a go-routine that begins extracting and caching the readme etc. of the latest version for each package (only).
Yes both you and Dimitri essentially are saying that. I am going to try and work something out along these lines
Are you talking about direct-helm? I wasn't aware of any cache there that stores things in memory
Yep, I was talking about a sort of cache of the index.yaml [1] which, AFAIK, it is used just in the helm-direct each time it does fetchRepoIndex
, which is invoked each time we create or upgrade a helm release [2].
[1] https://github.com/kubeapps/kubeapps/blob/master/pkg/chart/chart.go#L155 [2] https://github.com/kubeapps/kubeapps/blob/master/cmd/kubeapps-apis/plugins/helm/packages/v1alpha1/server.go#L1018
So this issue is pretty similar to what I also wanted to solve
I don't know about that. This issue doesn't have do with with parsing repo index. Are you referring to data stored in postgresql or some other cache I am not aware of?
I meant that we may need to also cache the chart themselves even in the helm-direct plugin to avoid re-fetching the index.yaml again and again.
Yep, I was talking about a sort of cache of the index.yaml [1] which, AFAIK, it is used just in the helm-direct each time it does
fetchRepoIndex
, which is invoked each time we create or upgrade a helm release [2].
Gasp! I didn't realize we had a global variable map that stores a potentially unbound number of entries that never expire. And this is in addition to the postgresql that stores the indexes too. Yikes. I thought we only had postgresql to fetch the data from, but I was wrong. You're right. This is a very bad anti-pattern and should be fixed right away!
Ahem, full disclosure: turns out I do the same with redis cache, which I remembered at some point to fix (I vaguely remember me and Michael discussing putting TTL on cache entries, but I dropped the ball on that) so I put a TODO in there to fix this. I promise to fix this very soon with redis cache. I know its bad. It's a different issue that this one because this one is merely ponders whether or not we should cache chart details for flux plugin. Nevertheless, it just became my top priority
Guys, FYI an update: I have implemented one tentative approach where when you add a new repo source, it caches the parsed index (as it already does) but now also the details of only THE LATEST version of each chart (hopefully that's the one that would be displayed in the catalog by default, so it looks instantaneous (a cache hit)). Just FYI, the chart details are README, values and schema files. Not zipped or anything, just in plain text. So, after bitnami is added to flux, I end up with about 100 keys in Redis (one for helmrepository #66 and the rest are details for each chart):
$ rdcli -h localhost --port 6379 --auth $REDIS_PWD "keys" "*"
1) helmcharts:default:bitnami/grafana:7.2.4
2) helmcharts:default:bitnami/jupyterhub:0.3.1
3) helmcharts:default:bitnami/grafana-operator:1.5.1
4) helmcharts:default:bitnami/spark:5.7.10
5) helmcharts:default:bitnami/kong:4.1.9
6) helmcharts:default:bitnami/spring-cloud-dataflow:4.2.0
7) helmcharts:default:bitnami/mariadb:10.0.2
8) helmcharts:default:bitnami/nats:7.0.1
9) helmcharts:default:bitnami/jenkins:8.0.18
10) helmcharts:default:bitnami/rabbitmq:8.24.7
11) helmcharts:default:bitnami/mongodb-sharded:3.9.15
12) helmcharts:default:bitnami/kubernetes-event-exporter:1.2.3
13) helmcharts:default:bitnami/kubeapps:7.5.10
14) helmcharts:default:bitnami/prestashop:13.3.0
15) helmcharts:default:bitnami/redis:15.6.1
16) helmcharts:default:bitnami/wavefront-prometheus-storage-adapter:1.0.12
17) helmcharts:default:bitnami/concourse:0.1.13
18) helmcharts:default:bitnami/dataplatform-bp2:10.0.0
19) helmcharts:default:bitnami/tensorflow-resnet:3.2.20
20) helmcharts:default:bitnami/redmine:17.1.0
21) helmcharts:default:bitnami/oauth2-proxy:1.1.1
22) helmcharts:default:bitnami/wavefront:3.1.17
23) helmcharts:default:bitnami/node:16.1.0
24) helmcharts:default:bitnami/owncloud:10.3.0
25) helmcharts:default:bitnami/osclass:12.0.0
26) helmcharts:default:bitnami/kibana:9.1.2
27) helmcharts:default:bitnami/cert-manager:0.1.25
28) helmcharts:default:bitnami/phpmyadmin:8.3.0
29) helmcharts:default:bitnami/wavefront-adapter-for-istio:1.0.12
30) helmcharts:default:bitnami/rabbitmq-cluster-operator:2.0.0
31) helmcharts:default:bitnami/ejbca:3.1.0
32) helmcharts:default:bitnami/orangehrm:10.2.0
33) helmcharts:default:bitnami/dokuwiki:11.2.13
34) helmcharts:default:bitnami/pytorch:2.3.19
35) helmcharts:default:bitnami/minio:9.2.1
36) helmcharts:default:bitnami/dataplatform-bp1:9.0.0
37) helmcharts:default:bitnami/common:1.10.1
38) helmcharts:default:bitnami/postgresql:10.13.8
39) helmcharts:default:bitnami/kafka:14.4.1
40) helmcharts:default:bitnami/kubewatch:3.2.20
41) helmcharts:default:bitnami/apache:8.9.6
42) helmcharts:default:bitnami/airflow:11.1.8
43) helmcharts:default:bitnami/parse:15.0.14
44) helmcharts:default:bitnami/testlink:9.3.0
45) helmcharts:default:bitnami/argo-workflows:0.1.8
46) helmcharts:default:bitnami/joomla:10.2.1
47) helmcharts:default:bitnami/tomcat:9.6.0
48) helmcharts:default:bitnami/memcached:5.15.8
49) helmcharts:default:bitnami/aspnet-core:2.0.0
50) helmcharts:default:bitnami/consul:10.1.3
51) helmcharts:default:bitnami/sonarqube:0.1.0
52) helmcharts:default:bitnami/opencart:10.1.0
53) helmcharts:default:bitnami/nginx-ingress-controller:9.0.8
54) helmcharts:default:bitnami/postgresql-ha:8.0.1
55) helmcharts:default:bitnami/node-exporter:2.3.15
56) helmcharts:default:bitnami/phpbb:10.2.0
57) helmcharts:default:bitnami/contour:7.0.3
58) helmcharts:default:bitnami/ghost:15.1.2
59) helmcharts:default:bitnami/mongodb:10.29.4
60) helmcharts:default:bitnami/grafana-tempo:0.2.14
61) helmcharts:default:bitnami/suitecrm:9.4.1
62) helmcharts:default:bitnami/elasticsearch:17.3.3
63) helmcharts:default:bitnami/nginx:9.5.14
64) helmcharts:default:bitnami/harbor:11.1.0
65) helmcharts:default:bitnami/zookeeper:7.4.12
66) helmrepositories:default:bitnami
67) helmcharts:default:bitnami/mariadb-galera:6.0.5
68) helmcharts:default:bitnami/mxnet:2.3.22
69) helmcharts:default:bitnami/wildfly:12.0.0
70) helmcharts:default:bitnami/keycloak:5.2.0
71) helmcharts:default:bitnami/magento:19.0.9
72) helmcharts:default:bitnami/fluentd:4.4.1
73) helmcharts:default:bitnami/jasperreports:11.1.0
74) helmcharts:default:bitnami/kiam:0.3.20
75) helmcharts:default:bitnami/drupal:10.4.7
76) helmcharts:default:bitnami/argo-cd:2.0.12
77) helmcharts:default:bitnami/geode:0.3.4
78) helmcharts:default:bitnami/mysql:8.8.12
79) helmcharts:default:bitnami/cassandra:9.0.5
80) helmcharts:default:bitnami/kube-state-metrics:2.1.17
81) helmcharts:default:bitnami/wavefront-hpa-adapter:1.0.3
82) helmcharts:default:bitnami/external-dns:5.5.0
83) helmcharts:default:bitnami/wordpress:12.2.1
84) helmcharts:default:bitnami/haproxy:0.2.18
85) helmcharts:default:bitnami/influxdb:2.4.1
86) helmcharts:default:bitnami/mediawiki:12.4.2
87) helmcharts:default:bitnami/kube-prometheus:6.4.1
88) helmcharts:default:bitnami/etcd:6.10.3
89) helmcharts:default:bitnami/solr:2.1.3
90) helmcharts:default:bitnami/redis-cluster:7.0.11
91) helmcharts:default:bitnami/metrics-server:5.10.10
92) helmcharts:default:bitnami/logstash:3.6.17
93) helmcharts:default:bitnami/discourse:5.1.0
94) helmcharts:default:bitnami/contour-operator:0.2.0
95) helmcharts:default:bitnami/odoo:20.1.2
96) helmcharts:default:bitnami/metallb:2.5.11
97) helmcharts:default:bitnami/thanos:8.1.0
98) helmcharts:default:bitnami/moodle:11.2.1
The whole thing ends up taking about 25MB of Redis memory. Obviously any chart version not in the cache requested by user (a cache miss scenario), would take longer (about 2-3 seconds based on what I've seen so far) to load the first time, and then as above.
Just checking if this sounds like a reasonable approach and I should continue, or try something else
Wow just 25MB, that's great. This approach will speed up the user-perceived app performance, as they will see something as soon as they select something. From a UX point of view, I think it's ok letting a user wait for some seconds if they explicitly select a version, so, as for me, it's a nice cache approach.
That said, I just wonder (not meaning we should/shouldn't do it), whether we should make this "number of cached versions" configurable. Like in "latest version + latest-N previous ones" or "latest + latest-N and then LRU based on user-selection". Also, perhaps we can also add to the cache (in runtime) the currently installed version, so when the user tries to fetch the package details, the response time gets improved for subsequent calls.
OK, thanks, Antonio that's encouraging. Sounds like I'm on the right track. If I can cache 1, I can cache N, that's not too big a leap. I am considering another variant of this solution that would instead cache chart .tgz files instead of plain text. The chart details pretty much contain text files (such as README, chart manifest yaml, etc.) which compress REALLY well. What is 140KB uncompressed might end up only being less than 10KB compressed. So caching, say, 300 charts may not be a big deal if compressed. Typical time/space tradeoff. We would save a lot in space and lose a little bit in time as it takes more cycles to uncompress data on-the-fly. Seems like a reasonable tradeoff to me. I will get back to you when I have the numbers. BTW, this is the same thing that flux does when you create an instance of HelmChart CRD - they cache the chart .tgz file in the cluster.
Nice work Greg! I agree, the main think we want to solve is the initial load time, so much less of an issue that a user needs to wait when selecting a different version (a cache-miss, which I assume is then added to the cache).
Out of interest, how long does it take to cache just the first versions? (I assume it does this after the plugin is registered?).
Thank you, Michael. I have the numbers now from my latest experiment. bitnami+100 latest chart versions in the original format (.tgz) is 18MB. Vs 25 if uncompressed. I thought there'd be a bigger gain, but maybe the fact that the cache entry for bitnami repo (parsed model that includes metadata for all 14K versions) is the big elephant in the room, so everything else is small in comparison. Oh well, I still like that approach. To answer your last question, I have one background worker right now syncing charts one-by-one sequentially so the first chart starts at
I1129 01:58:26.622102 1 chart_cache.go:193] +syncHandler(helmcharts:default:bitnami/airflow:11.1.8)
and the last one ends at
I1129 01:59:27.937374 1 chart_cache.go:243] -syncHandler(helmcharts:default:bitnami/zookeeper:7.4.12)
So that's a minute. Obviously this work can and will benefit from parallelization. I will have several (5 or 10) workers doing this concurrently. So that number is going to change
Here are a few more numbers from my MacBook I just collected: Helm plugin GetAvailablePackageDetail:
$ time grpcurl -plaintext -d '{"available_package_ref": {"context": {"cluster": "default", "namespace": "kubeapps"}, "plugin": {"name": "helm.packages", "version": "v1alpha1"}, "identifier": "bitnami/apache"}}' -H "Authorization: Bearer $TOKEN" localhost:8080 kubeappsapis.core.packages.v1alpha1.PackagesService.GetAvailablePackageDetail
{
"availablePackageDetail": {
"availablePackageRef": {
"context": {
"namespace": "kubeapps"
},
"identifier": "bitnami/apache",
"plugin": {
"name": "helm.packages",
"version": "v1alpha1"
}
},
... skipped ...
real 0m0.168s
user 0m0.024s
sys 0m0.017s
Flux plugin GetAvailablePackageDetail (cache hit) on my local branch with the implementation described above:
$ time grpcurl -plaintext -d '{"available_package_ref": {"context": {"cluster": "default", "namespace": "default"}, "plugin": {"name": "fluxv2.packages", "version": "v1alpha1"}, "identifier": "bitnami/apache"}}' -H "Authorization: Bearer $TOKEN" localhost:8080 kubeappsapis.core.packages.v1alpha1.PackagesService.GetAvailablePackageDetail
{
"availablePackageDetail": {
"availablePackageRef": {
"context": {
"cluster": "default",
"namespace": "default"
},
"identifier": "bitnami/apache",
"plugin": {
"name": "fluxv2.packages",
"version": "v1alpha1"
}
},
... skipped ...
real 0m0.223s
user 0m0.026s
sys 0m0.014s
Flux plugin GetAvailablePackageDetail (cache miss):
$ time grpcurl -plaintext -d '{"available_package_ref": {"context": {"cluster": "default", "namespace": "default"}, "plugin": {"name": "fluxv2.packages", "version": "v1alpha1"}, "identifier": "bitnami/apache"}, "pkg_version" : "8.7.0"}' -H "Authorization: Bearer $TOKEN" localhost:8080 kubeappsapis.core.packages.v1alpha1.PackagesService.GetAvailablePackageDetail
{
"availablePackageDetail": {
"availablePackageRef": {
"context": {
"cluster": "default",
"namespace": "default"
},
"identifier": "bitnami/apache",
"plugin": {
"name": "fluxv2.packages",
"version": "v1alpha1"
}
},
... skipped ...
real 0m0.458s
user 0m0.024s
sys 0m0.017s
Looks OK to me
Just moving from the PR discussion to an actual issue, as requested by @gfichtenholt , where Greg said:
This was in response to my comment that: