Closed kopasiak closed 4 years ago
@kopasiak Thanks for bringing this up. I tried to reproduce the problem but end up with the following output:
make SKIP_LINT=TRUE
[common]
[common]
==> Linting common
[INFO] Chart.yaml: icon is recommended
1 chart(s) linted, 0 chart(s) failed
Successfully packaged chart and saved it to: /Users/mfarina/Code/deleteme/oom/kubernetes/dist/packages/common-5.0.0.tgz
[cassandra]
Error: no repository definition for @local. Please add them via 'helm repo add'
make[3]: *** [dep-cassandra] Error 1
make[2]: *** [cassandra] Error 2
make[1]: *** [make-common] Error 2
make: *** [common] Error 2
So, I'm unable to reproduce. Any pointers would be helpful. Is there a place one can download the packaged charts without needing to generate them?
I am surprised linting is taking so long. I wonder if there is an issue with recursion somewhere. There was a bug in Helm v3 we recently fixed that was caused by recursion. It may exist in Helm v2.
Ahhhhhh....
Sorry I forgot to mention that you need to execute:
helm serve &
and then add local repo, before building the charts.
You may be right that there is some recursion issue and the computational complexity is simply exploding somewhere. I noticed that even when a chart has 5 subcharts it lints significantly longer than charts that don't have any subcharts.
Hi,
is there anything that we could do to help to resolve this issue? We are now close to 2 hours of linting time which is far from optimal for our CI system...
I'm happy to provide any help or any debug logs (just tell me how) to get this fixed
@kopasiak I wasn't able to entirely use your Makefile but I was able to replicate most of the commands. In my local experience most of the time was in the helm dep up
workflow because Helm checks for updates to the repos each time this happens.
The reason for this is that by default the stable
repo is included in Helm v2. That repo is taking 30-45 seconds for me to update each time which happens each time helm dep up
is called. It has a very large number of charts and chart versions which takes a long time to update. It has to do with the nuances of YAML parsing.
If you have the stable
repo and don't need it I would suggest deleting it from the Helm registries before running all you CI. This may give you a large performance improvement.
@mattfarina That part is already done by us:(
Maybe I'll just pack my .helm dir together with my local chart repository so that you can just run:
helm lint onap
and see what's the issue? Would it make sense to you?
@mattfarina any update on this? Our lint takes now more than 2 hours which is insane
@kopasiak have you had a chance to compile Helm locally to try and determine the cause? It sounds like @mattfarina has been unable to reproduce the issue on his end. I know he's been quite busy trying to push Helm through the CNCF graduation process recently. It's not likely he's had the time to dive into your issue specifically.
This document may come in handy should you choose to determine the root cause yourself: https://helm.sh/docs/community/developers/
Let us know if you find anything else that may provide more insight into the cause.
@bacongobbler I have compiled it locally and even collected the CPU data (see link in my first post) but to be completely honest with you I have 0 experience with helm internals and in general with go programming (but I can learn quickly;) so if you could please give me some hints what I should check or sth it would be very helpful.
So personally I expect that this is a problem of some computation complexity which causes a single chart to be validated multiple times but I don't know what I could do to check if that's true
@bacongobbler @mattfarina I've made a little bit more investigation and I discovered that most of the time that we need to lint the chart is spend in engine.renderWithReferences().
Recently in our charts we started using templates more widely to reduce the amount of boilerplate that we have between components. Unfortunately it looks like templates has significant performance penalty in helm. In my measurements the typical file that uses only simple "references" to values takes only us ~10 us to render while a file that has a few includes, tpl in values etc. takes way over seconds to render.
Are you guys aware of such performance penalty for extensive use of templates? Actually we thought that this would be the desired way of doing that as it really reduces the amount of boilerplate that we have
BTW I've also noticed that CPU usage while doing this render is relatively low as it uses only a single core. Have you considered making this template rendering multi-core?
@bacongobbler @mattfarina Just letting you know that I've just reproduced this issue with helm v3 (tested with v3.1.2)
Hey there,
I was able to reproduce this issue using a Helm 3 client (I couldn't compile the latest stable version 2) but I just used the version 2 to bootstrap the Tiller server required for OOM.
brew link helm@2 --force
git clone --recurse-submodules https://gerrit.onap.org/r/oom
cd oom/kubernetes
make repo
The problem occurs when the lint process is executed against the onap
folder.
~/go/src/k8s.io/helm/bin/helm lint onap
As @kopasiak mentioned, it takes a lot of time to execute the renderWithReferences function.
One thing that I noticed is that OOM's charts have several non-yaml files (.xml, .sql, *.sh, etc.) which are included as part of the lint process. So I'm not sure if the recAllTpls function needs to exclude whatever that is defined on the .helmignore files to reduce the process to create the parent template.
Where is the chart that is causing this problem? I took a quick glance at the ONAP repo and couldn't find a chart that looked like the one described above.
Has anyone reproduced on any other charts? This is the only report I have seen of such excessively long processing time for a lint
Where is the chart that is causing this problem? I took a quick glance at the ONAP repo and couldn't find a chart that looked like the one described above.
it's in kubernetes directory:
https://github.com/onap/oom/tree/master/kubernetes/onap
Has anyone reproduced on any other charts? This is the only report I have seen of such excessively long processing time for a
lint
I'm not sure how all other people are using helm but what I've noticed in chart library is that charts are usually quite small. In our case, ONAP chart with all "requirements" enabled is really large (thousands of files).
To me it looks like if some operation (search/insert or sth) was done using wrong datatype (for example list instead of hash map) but it's just my guess. The place that looks suspicious to me is Render() in engine.go as a lot if time is spent there...
The other thing is that it looks that everything is processed using a single CPU which is even under a full load as we spend a lot of time just allocating and releasing memory.
I am looking at this problem now. Admittedly, this is a lower-priority issue because it does not seem to impact most users. But I am currently focusing my sprint on helm lint
, so this fits in with my workload. I might not be able to fix it, but hopefully I will at least be able to diagnose it.
@technosophos Awesome! Let me know if you need any more details
@technosophos correct me if I'm wrong but AFAIK the .helmignore
files are used to exclude files from the Helm Chart creation right? shouldn't be better to also exclude them from the lint process as well?
It might make sense to extend .helmignore
to linting, yes. I'll look into that.
make SKIP_LINT=true
is failing with:
[robot]
Makefile:54: *** Submodule robot needs to be retrieved from gerrit. See https://wiki.onap.org/display/DW/OOM+-+Development+workflow+after+code+transfer+to+tech+teams . Stop.
make[1]: *** [submod-robot] Error 2
make: *** [robot] Error 2
Is that really something I need in order to test the linting?
You just need to clone including submodules:
$ git clone --recurse-submodules https://gerrit.onap.org/r/oom
Ah! Thanks. I missed that in the previous set of directions.
Wow! This chart has 2,739 template files! That might actually be bigger than the openstack chart.
Told you:D are there any awards for having the biggest chart ever?xD
So at the end of the day, my conclusion is not terribly surprising to me. There are two major contributing factors:
common
thing is all the same chart, you can save gobs of time and space by just including that once in the umbrella chart.tpl
function. It is absolutely horrible for performance because each instance of it has to set up a sandboxed environment and then load all of the necessary templates into that environment. Helm is literally spending hours during lint setting up and tearing down those sandboxes.For fun, I replaced the body of tpl
with a simple function that returned "hello" and a counter. helm lint
ran in 2 seconds, and tpl
was executed 2276 times. Worst case in your current code: because of tpl
, you would be loading about 6.2 million copies of the templates by creating 2276 sandboxes and loading 2739 templates into each sandbox.
tpl
is evil.
While there is probably some room for optimization in the way the tpl
sandbox is created, it's not something i personally have time to do right now. Feel free to dive in there. The best place to start is the tpl
function in engine.go
. But I would guess that you could get fast results by doing the two things I suggested above.
This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.
Hi, @kopasiak - thanks again for reaching out. Along with the advice in https://github.com/helm/helm/issues/7656#issuecomment-620890762, you probably want to look at moving to Helm v3 now if you are still using v2 as implied in "used the version 2 to bootstrap the Tiller server required for OOM". The reason is that Helm v2 is going to be out of support very soon, so it will not be possible to get bugfixes or security patches. See https://helm.sh/blog/helm-v2-deprecation-timeline/ for more details.
Looking to resurrect this thread as our linting time is now ~11hrs. Can anyone assist with a solution here as we are killing our CI environments with this.
sudo apt install make SKIP_LINT=TRUE
i am trying to add a local helm repo
Unable to locate package SKIP_LINT
i am getting this error
please guide
Output of
helm version
: v2.14.2Output of
kubectl version
: v1.15.2Cloud Provider/Platform (AKS, GKE, Minikube etc.): Minikube
We are using helm to deploy fairly large Open Source project called ONAP.
ONAP consists of submodules that can be deployed individually and one "collective chart" called onap that allows you to deploy all components at once. Since we started using our custom templates to simplify our charts we noticed that the time required to lint onap chart started growing really fast. Currently it takes around 20-30 minutes depending on the machine which is couple of times more than the cumulative time required to lint all our other charts.
I've tried to figure out sth on my own but I'm not a go specialist so thanks to my colleague I managed only to collect the CPU profile (cpu_profile_onap.zip) from linting.
We are not sure if this is a bug or it's just a price that you need to pay for using templates so would appreciate your feedback on this.
If you are willing to reproduce the results just do: