operate-first / apps

Operate-first application manifests
GNU General Public License v3.0
51 stars 137 forks source link

Operate-first prow / CI issues from `kubeval` check #2288

Closed Gregory-Pereira closed 1 year ago

Gregory-Pereira commented 2 years ago

Prow has been slow recently, and a few PRs are being blocked due to their kubeval check failing see example. My theory is these performance issues are related to that kubeval check itself, in that it is not currently checking the diff, but the entire repo as you can see in that check example log. It eventually values with an exit code 137 which means it is dying due to memory constraints. Can kubeval be set to run only against the diff produced by the PR @mimotej? If not we can consider upping the resources per PR (although that could become pretty intensive), or we should consider removing the check entirely.

/cc @mimotej

Gregory-Pereira commented 2 years ago

There was a similar issue with precommit a while back. Not sure if this solution is repeatable here but sharing it for reference: https://github.com/thoth-station/prescriptions/pull/27401/files. https://github.com/thoth-station/prescriptions-refresh-job/issues/52#issuecomment-1217942813

Gregory-Pereira commented 2 years ago

So it turns out it is checking for a diff as seen here. However running that against my example pr produces this:

$ PULL_BASE_SHA="281d08f5c6ec49eeecf16d70ee8cde57f7d881fe" 
$ PULL_PULL_SHA="28883326290e823552a955817d37af88ef31857d"
$ bases=$(git diff --name-status $PULL_BASE_SHA $PULL_PULL_SHA | grep -Po "^[^DR]\s+\K.*?(?=/)|^R\w*\s+.*\s+\K.*?(?=/)" | sort | uniq)
$ echo $bases
cluster-scope

As you can see, it returns the entire cluster-scope directory as the resulting diff, instead of the one file: cluster-scope/base/operators.coreos.com/subscriptions/acm-operator-subscription/subscription.yaml

mimotej commented 2 years ago

For now seems that upping resources fo kubeval is enough will have to investigate further why.

/assign

mimotej commented 2 years ago

Little update: Problem seems to be in regex which is used after because what it does it matches highest layer in directory path meaning in this case cluster-scope Screenshot from 2022-08-25 16-06-25

If we figure out how to change this regex to match lowest possible layer in directory path it will fix it... This issue also applies to kustomize-build test there it is only less visible because it takes less resources overall

tumido commented 2 years ago

As you can see, it returns the entire cluster-scope directory as the resulting diff, instead of the one file: cluster-scope/base/operators.coreos.com/subscriptions/acm-operator-subscription/subscription.yaml

Which is the correct outcome. Kubeval can't be run against partial manifests in base/overlays... Keep in mind that we modify and patch those manifests in overlays so for example subscription which requires a channel to be specified gains the value in overlay, etc.

The grep command is matching the root application folder intentionally because so far we don't have any heuristic to determine which base or overlay is including which modified file... And we need to run each change set through kustomize build before we can kubeval it, without it, partial manifests may not and won't fit the api spec resulting in false positives.

Gregory-Pereira commented 2 years ago

Then this seems to be logical and intended behaviour. I think this explanation coupled with your PR to exclude some pre-submit checks based on the files changed should resolve this issue. Would you agree @tumido, and if so can I close this?

sesheta commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

tumido commented 1 year ago

I don't think this is an issue anymore.