helm lint performance issues

kopasiak commented 4 years ago

Output of helm version: v2.14.2

Output of kubectl version: v1.15.2

Cloud Provider/Platform (AKS, GKE, Minikube etc.): Minikube

We are using helm to deploy fairly large Open Source project called ONAP.

ONAP consists of submodules that can be deployed individually and one "collective chart" called onap that allows you to deploy all components at once. Since we started using our custom templates to simplify our charts we noticed that the time required to lint onap chart started growing really fast. Currently it takes around 20-30 minutes depending on the machine which is couple of times more than the cumulative time required to lint all our other charts.

I've tried to figure out sth on my own but I'm not a go specialist so thanks to my colleague I managed only to collect the CPU profile (cpu_profile_onap.zip) from linting.

We are not sure if this is a bug or it's just a price that you need to pay for using templates so would appreciate your feedback on this.

If you are willing to reproduce the results just do:

$ git clone git@github.com:onap/oom.git
$ cd oom/kubernetes
$ make SKIP_LINT=TRUE
$ helm lint onap

mattfarina commented 4 years ago

@kopasiak Thanks for bringing this up. I tried to reproduce the problem but end up with the following output:

make SKIP_LINT=TRUE

[common]

[common]
==> Linting common
[INFO] Chart.yaml: icon is recommended

1 chart(s) linted, 0 chart(s) failed
Successfully packaged chart and saved it to: /Users/mfarina/Code/deleteme/oom/kubernetes/dist/packages/common-5.0.0.tgz

[cassandra]
Error: no repository definition for @local. Please add them via 'helm repo add'
make[3]: *** [dep-cassandra] Error 1
make[2]: *** [cassandra] Error 2
make[1]: *** [make-common] Error 2
make: *** [common] Error 2

So, I'm unable to reproduce. Any pointers would be helpful. Is there a place one can download the packaged charts without needing to generate them?

I am surprised linting is taking so long. I wonder if there is an issue with recursion somewhere. There was a bug in Helm v3 we recently fixed that was caused by recursion. It may exist in Helm v2.

kopasiak commented 4 years ago

Ahhhhhh.... Sorry I forgot to mention that you need to execute: helm serve & and then add local repo, before building the charts.

You may be right that there is some recursion issue and the computational complexity is simply exploding somewhere. I noticed that even when a chart has 5 subcharts it lints significantly longer than charts that don't have any subcharts.

kopasiak commented 4 years ago

Hi,

is there anything that we could do to help to resolve this issue? We are now close to 2 hours of linting time which is far from optimal for our CI system...

I'm happy to provide any help or any debug logs (just tell me how) to get this fixed

mattfarina commented 4 years ago

@kopasiak I wasn't able to entirely use your Makefile but I was able to replicate most of the commands. In my local experience most of the time was in the helm dep up workflow because Helm checks for updates to the repos each time this happens.

The reason for this is that by default the stable repo is included in Helm v2. That repo is taking 30-45 seconds for me to update each time which happens each time helm dep up is called. It has a very large number of charts and chart versions which takes a long time to update. It has to do with the nuances of YAML parsing.

If you have the stable repo and don't need it I would suggest deleting it from the Helm registries before running all you CI. This may give you a large performance improvement.

kopasiak commented 4 years ago

@mattfarina That part is already done by us:(

Maybe I'll just pack my .helm dir together with my local chart repository so that you can just run: helm lint onap and see what's the issue? Would it make sense to you?

kopasiak commented 4 years ago

@mattfarina any update on this? Our lint takes now more than 2 hours which is insane

bacongobbler commented 4 years ago

@kopasiak have you had a chance to compile Helm locally to try and determine the cause? It sounds like @mattfarina has been unable to reproduce the issue on his end. I know he's been quite busy trying to push Helm through the CNCF graduation process recently. It's not likely he's had the time to dive into your issue specifically.

This document may come in handy should you choose to determine the root cause yourself: https://helm.sh/docs/community/developers/

Let us know if you find anything else that may provide more insight into the cause.

kopasiak commented 4 years ago

@bacongobbler I have compiled it locally and even collected the CPU data (see link in my first post) but to be completely honest with you I have 0 experience with helm internals and in general with go programming (but I can learn quickly;) so if you could please give me some hints what I should check or sth it would be very helpful.

So personally I expect that this is a problem of some computation complexity which causes a single chart to be validated multiple times but I don't know what I could do to check if that's true

kopasiak commented 4 years ago

@bacongobbler @mattfarina I've made a little bit more investigation and I discovered that most of the time that we need to lint the chart is spend in engine.renderWithReferences().

Recently in our charts we started using templates more widely to reduce the amount of boilerplate that we have between components. Unfortunately it looks like templates has significant performance penalty in helm. In my measurements the typical file that uses only simple "references" to values takes only us ~10 us to render while a file that has a few includes, tpl in values etc. takes way over seconds to render.

Are you guys aware of such performance penalty for extensive use of templates? Actually we thought that this would be the desired way of doing that as it really reduces the amount of boilerplate that we have

BTW I've also noticed that CPU usage while doing this render is relatively low as it uses only a single core. Have you considered making this template rendering multi-core?

kopasiak commented 4 years ago

@bacongobbler @mattfarina Just letting you know that I've just reproduced this issue with helm v3 (tested with v3.1.2)

electrocucaracha commented 4 years ago

Hey there,

I was able to reproduce this issue using a Helm 3 client (I couldn't compile the latest stable version 2) but I just used the version 2 to bootstrap the Tiller server required for OOM.

brew link helm@2 --force
git clone --recurse-submodules https://gerrit.onap.org/r/oom
cd oom/kubernetes
make repo

The problem occurs when the lint process is executed against the onap folder.

~/go/src/k8s.io/helm/bin/helm lint onap

As @kopasiak mentioned, it takes a lot of time to execute the renderWithReferences function.

One thing that I noticed is that OOM's charts have several non-yaml files (.xml, .sql, *.sh, etc.) which are included as part of the lint process. So I'm not sure if the recAllTpls function needs to exclude whatever that is defined on the .helmignore files to reduce the process to create the parent template.

technosophos commented 4 years ago

Where is the chart that is causing this problem? I took a quick glance at the ONAP repo and couldn't find a chart that looked like the one described above.

Has anyone reproduced on any other charts? This is the only report I have seen of such excessively long processing time for a lint

kopasiak commented 4 years ago

Where is the chart that is causing this problem? I took a quick glance at the ONAP repo and couldn't find a chart that looked like the one described above.

it's in kubernetes directory:

https://github.com/onap/oom/tree/master/kubernetes/onap

Has anyone reproduced on any other charts? This is the only report I have seen of such excessively long processing time for a lint

I'm not sure how all other people are using helm but what I've noticed in chart library is that charts are usually quite small. In our case, ONAP chart with all "requirements" enabled is really large (thousands of files).

To me it looks like if some operation (search/insert or sth) was done using wrong datatype (for example list instead of hash map) but it's just my guess. The place that looks suspicious to me is Render() in engine.go as a lot if time is spent there...

The other thing is that it looks that everything is processed using a single CPU which is even under a full load as we spend a lot of time just allocating and releasing memory.

technosophos commented 4 years ago

I am looking at this problem now. Admittedly, this is a lower-priority issue because it does not seem to impact most users. But I am currently focusing my sprint on helm lint, so this fits in with my workload. I might not be able to fix it, but hopefully I will at least be able to diagnose it.

kopasiak commented 4 years ago

@technosophos Awesome! Let me know if you need any more details

electrocucaracha commented 4 years ago

@technosophos correct me if I'm wrong but AFAIK the .helmignore files are used to exclude files from the Helm Chart creation right? shouldn't be better to also exclude them from the lint process as well?

technosophos commented 4 years ago

It might make sense to extend .helmignore to linting, yes. I'll look into that.

technosophos commented 4 years ago

make SKIP_LINT=true is failing with:

[robot]
Makefile:54: *** Submodule robot needs to be retrieved from gerrit.  See https://wiki.onap.org/display/DW/OOM+-+Development+workflow+after+code+transfer+to+tech+teams .  Stop.
make[1]: *** [submod-robot] Error 2
make: *** [robot] Error 2

Is that really something I need in order to test the linting?

kopasiak commented 4 years ago

You just need to clone including submodules:

$ git clone --recurse-submodules https://gerrit.onap.org/r/oom

technosophos commented 4 years ago

Ah! Thanks. I missed that in the previous set of directions.

technosophos commented 4 years ago

Wow! This chart has 2,739 template files! That might actually be bigger than the openstack chart.

kopasiak commented 4 years ago

Told you:D are there any awards for having the biggest chart ever?xD

technosophos commented 4 years ago

So at the end of the day, my conclusion is not terribly surprising to me. There are two major contributing factors:

It appears that you are deeply nesting duplicates of the same charts over and over again. How come? If you can figure out a way to reduce the deep nesting, it will help tremendously. If the common thing is all the same chart, you can save gobs of time and space by just including that once in the umbrella chart.
The real performance killer, though, is the tpl function. It is absolutely horrible for performance because each instance of it has to set up a sandboxed environment and then load all of the necessary templates into that environment. Helm is literally spending hours during lint setting up and tearing down those sandboxes.

For fun, I replaced the body of tpl with a simple function that returned "hello" and a counter. helm lint ran in 2 seconds, and tpl was executed 2276 times. Worst case in your current code: because of tpl, you would be loading about 6.2 million copies of the templates by creating 2276 sandboxes and loading 2739 templates into each sandbox.

tpl is evil.

While there is probably some room for optimization in the way the tpl sandbox is created, it's not something i personally have time to do right now. Feel free to dive in there. The best place to start is the tpl function in engine.go. But I would guess that you could get fast results by doing the two things I suggested above.

github-actions[bot] commented 4 years ago

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

bridgetkromhout commented 4 years ago

Hi, @kopasiak - thanks again for reaching out. Along with the advice in https://github.com/helm/helm/issues/7656#issuecomment-620890762, you probably want to look at moving to Helm v3 now if you are still using v2 as implied in "used the version 2 to bootstrap the Tiller server required for OOM". The reason is that Helm v2 is going to be out of support very soon, so it will not be possible to get bugfixes or security patches. See https://helm.sh/blog/helm-v2-deprecation-timeline/ for more details.

efiacor commented 2 years ago

Looking to resurrect this thread as our linting time is now ~11hrs. Can anyone assist with a solution here as we are killing our CI environments with this.

ckfahad commented 2 years ago

sudo apt install make SKIP_LINT=TRUE

i am trying to add a local helm repo

Unable to locate package SKIP_LINT i am getting this error

please guide

helm / helm

helm lint performance issues #7656