feat: Eliminate redundant repo updates

Nuru commented 5 years ago

If we use a single helmfile.yaml that has a helmfiles section to refer to a bunch of other helmfiles, when we run helmfile deps it updates all the repos once for each referenced helmfile. This is a big waste of time considering that we have 20 helmfiles and 15 of them reference either the stable or incubator. Granted we have made the problem a little worse by using public charts that sometimes refer to https://kubernetes-charts-incubator.storage.googleapis.com as kubernetes-incubator and sometimes as incubator, which then get treated like 2 separate repositories, but still it is a big pain and perhaps a source of several of the other feature requests suggesting ways to limit deps and repo updates.

Currently, for every helmfile, every repos gets updated, so we get 20 * 2 = 40 updates just of incubator. It would probably take 10 minutes off our helm deps execution time if, in a single helm deps execution, each repo would only be updated once.

sstarcher commented 5 years ago

Previously helmfile sync and helmfile diff had this behavior, but upgrading from 0.60.x to the latest of 0.80.x this behavior seems to have reverted.

mumoshu commented 5 years ago

@Nuru @sstarcher Thanks for reporting! Wouldn't #828 fully fix this, am I correct?

Nuru commented 5 years ago

@mumoshu I do not fully understand what #828 does.

It does not seem to me like it would fix the issue of 1 repo with 2 different aliases being treated as 2 separate repos and thus updated twice.
It is not clear to me that it would mean then when a helmfile adds a new repo and causes an update, the existing repos would not also get updated, even though they were updated by an earlier helmfile.

mumoshu commented 5 years ago

different aliases being treated as 2 separate repos

Ah good point! You are correct.

The fix is rather hard though. We must fix it so that (1)we use the url of each repo rather than name to skip redundant updates:

https://github.com/roboll/helmfile/blob/master/pkg/app/context.go#L22-L30

And we must enhance helmfile to save/restore the repository index cache instead of delegating it to helm:

$ ls ~/.helm/repository/cache/
brigade-index.yaml    flagger-index.yaml    flux-index.yaml       fluxcd-index.yaml     incubator-index.yaml  istio.io-index.yaml   local-index.yaml      sp-index.yaml         stable-index.yaml

It is not clear to me that it would mean then when a helmfile adds a new repo and causes an update, the existing repos would not also get updated, even though they were updated by an earlier helmfile.

So helmfile as of today triggers at-most-one helm repo up per each helmfile.yaml. If helmfile triggers helm repo up for two separate helmfile.yaml files, the latter repo up doesn't skip already updated ones, as helm repo up doesn't support limiting repos to be updated...

So if we're going to support it in helmfile, helmfile must be enhanced to filter repos to be updated. Maybe helmfile should run helm repo in where $HELM_HOME is pointed to a temporary directory set up by helmfile and copies it into the ~/.helm before running any helm command other than repo up?

Nuru commented 5 years ago

@mumoshu I think we have the same general ideas about how to fix this, but I have some different specific ideas. I have not had the time to fully learn and understand how helmfile works, but I understand it and helm a little better now, and with that I would attempt to solve this problem this way:

Part 1, do not try to solve for 1 repo having multiple names, just solve for only 1 set of updates for 1 invocation of `helmfile` no matter how many helmfiles are in `helmfile.d`

As far as I can tell, helm repo add performs an update on the repo if it already exists, so I would leverage that behavior this way:

In a first pass through the helmfile.d for any operation that is going to require repos to be updated:

Parse the repository section of all helmfiles, collecting all the repo definitions
De-duplicate the list of repo definitions so you have only 1 definition per repo name. (By the way, what happens now if 2 different helmfiles give different URLs the same name? For example, one defines local as http://127.0.0.1:8879/charts and another defines local as https://my-chart-repo.my-co.io?)
Run helm repo add for each repository "name" defined. So only one add per name, but at this level we will still add the same repo URL twice if it is defined twice with different names.
Consider all repos updated
Start the next pass through the helmfiles as is done now, but suppress any helm updates to repos for the rest of the run

If I'm wrong about add also doing update, then after adding all the repos, call helm repo up once.

Part 2, solve for 1 repo having multiple names

Probably better to do this as a PR for helm, because otherwise you are creating a fragile dependency of helmfile on internals of helm, and the fix would benefit helm users and be easier to implement in helm anyway. I understand the internals of helm even less than I do the internals of helmfile, but I hope that one or more of the helmfile contributors understands helm well enough to make this PR:

Change the repository cache key that helm uses from <name>.yaml to <URL>.yaml or <URL encoding of URL>.yaml depending on how much you care about the URL having special characters in it.
Change helm repo up to update caches by unique cache key rather than by name.

I think this will work without much disruption to the rest of helm. Helm would continue to have a repository.yaml entry for each repo name, so any action on a repo name would work as before. Since this would speed up helm repo up, I think the helm maintainers should accept it.

Even without this "Part 2" solution, the "Part 1" will dramatically speed up my use case of having 20+ helmfiles.

mumoshu commented 5 years ago

As far as I can tell, helm repo add performs an update on the repo if it already exists, so I would leverage that behavior this way:

Wow, neat trick! I have not realized that helm repo add had updated the repo until now, but yeah that would definitely work.

De-duplicate the list of repo definitions so you have only 1 definition per repo name. (By the way, what happens now if 2 different helmfiles give different URLs the same name? For example, one defines local as http://127.0.0.1:8879/charts and another defines local as https://my-chart-repo.my-co.io?)

I think we need to solve the part 2 in some way in order to address this de-duplication.

The easiest way came to my mind is to enhance Helmfile to internally translate the repo aliases defined and referenced within helmfile.yaml into globally unique repo IDs.

An example implementation of URLStringToRepoKey to illustrate it follows:

package main

import (
  "os"
  "regexp"
)

var (
    r = regexp.MustCompile(`[^[:alnum:]]+`)
)

func URLStringToRepoKey(s string) string {
     return r.ReplaceAllString(s, "-")
}

func main() {
  k := URLStringToRepoKey(os.Args[1])
  println(k)
}

$ go run main.go https://fluxcd.github.io/flux
https-fluxcd-github-io-flux

helmfile.yaml like the below:

repositories:
- name: flagger
   url: https://flagger.app

releases:
- name: podinfo
  chart: flagger/podinfo

Will be internally transformed to:

repositories:
- name: https-flagger-app
   url: https://flagger.app

releases:
- name: podinfo
  chart: https-flagger-app/podinfo

WDYT?

Nuru commented 5 years ago

Your idea of internally translating helm repo names solves the problem of needing to make changes to the helm code, and it might be good enough, but I have 2 concerns.

First is that the names have to really be globally unique and avoid name collisions. We have some complicated repo URLs in our charts, for example git+https://github.com/stakater/Forecastle@deployments/kubernetes/chart?ref=8f36b82beaf2a1a42b364a3857bc83638c51e30b, and we want to make sure we handle these weird cases properly. That is why I suggested URL-encoding them.

package main

import (
    "net/url"
    "os"
)

func URLStringToRepoKey(s string) string {
    return url.QueryEscape(s)
}

func main() {
    k := URLStringToRepoKey(os.Args[1])
    println(k)
}

One downside of either of these renaming conventions is that the output from helm will contain the renamed repos, which could be confusing to users.

"https-flagger-app" has been added to your repositories
"git%2Bhttps%3A%2F%2Fgithub.com%2Fstakater%2FForecastle%40deployments%2Fkubernetes%2Fchart%3Fref%3D8f36b82beaf2a1a42b364a3857bc83638c51e30b" has been added to your repositories

Another downside is that users using both helmfile and helm will have 2 sets of repos that are not necessarily in sync. I don't know the practical impact of that, maybe it is not significant.

As much as I'd like to avoid duplicating individual repos, that seems like a harder problem than only updating repos once per invocation of helmfile, and I don't want the difficulty of fixing this harder problem to delay fixing the easier one, especially since the easier one is much more impactful.

mumoshu commented 5 years ago

Another downside is that users using both helmfile and helm will have 2 sets of repos that are not necessarily in sync.

I believe you are correct.

I don't know the practical impact of that, maybe it is not significant.

One of the potential impacts would be that helm used outside of helmfile suffers from longer helm repo up run. However, I believe it can be solved simply by helmfile setting an alternative HELM_HOME:

$ helm home
/Users/mumoshu/.helm

$ HELM_HOME=~/.helmfile/ helm home
/Users/mumoshu/.helmfile/

I'd appreciate if you have other relevant cases but if that's the only problem I'm fine with this helmfile-specific HELM_HOME option.

I don't want the difficulty of fixing this harder problem to delay fixing the easier one, especially since the easier one is much more impactful.

That's completely okay! If you could submit a pull request for the easier one, I will add improvements on top.

Nuru commented 5 years ago

Normally I'm happy to submit PRs, but I'm still very new to Go, so I don't think I will be able to generate a PR I'm comfortable with for quite a while. Creating a new processing pass for helmfiles is simple in theory, but in practice, properly integrating it intohelmfile requires a much deeper understanding of how it works than I have. And then I don't know how best to stop helm from automatically updating repos in later stages.

I will give it a try, and maybe it will be easier than I expect, but please don't wait for me.

roboll / helmfile