kata-containers / packaging

Kata Containers version 1.x packaging (for version 2.x see https://github.com/kata-containers/kata-containers).
https://katacontainers.io/
Apache License 2.0
119 stars 92 forks source link

static-build: build the real static binary for clh #1099

Closed likebreath closed 4 years ago

likebreath commented 4 years ago

This patch reuses the 'dev_cli.sh' from cloud-hypervisor repo to build a static binary of clh without creating our own docker image nor installing extra dependencies.

Fixes: #1033

Signed-off-by: Bo Chen chen.bo@intel.com

Note: As we don't have full CI coverage in this repo, I am using a dummy PR from the tests repo (https://github.com/kata-containers/tests/pull/2733) to verify this change is passing all CI jobs related to cloud-hypervisor.

likebreath commented 4 years ago

/test-ubuntu

likebreath commented 4 years ago

/test-clh

likebreath commented 4 years ago

As pointed by @jcvenegas, the clh CI now is broken and is being fixed by two pending PRs (#1098 and https://github.com/kata-containers/runtime/pull/2843). We will need to land those two PRs before we can merge this one.

likebreath commented 4 years ago

/test-clh

likebreath commented 4 years ago

/test-clh

likebreath commented 4 years ago

/test

amshinde commented 4 years ago

/test-clh /test-ubuntu /AzurePipelines run

jcvenegas commented 4 years ago

/AzurePipelines run

azure-pipelines[bot] commented 4 years ago
Azure Pipelines successfully started running 2 pipeline(s).
likebreath commented 4 years ago

/test-ubuntu

likebreath commented 4 years ago

/test-clh

likebreath commented 4 years ago

/test-clh /test-ubuntu /AzurePipelines run

jcvenegas commented 4 years ago

/AzurePipelines run

azure-pipelines[bot] commented 4 years ago
Azure Pipelines successfully started running 2 pipeline(s).
likebreath commented 4 years ago

Although the PR itself passed all necessary checks, we are still facing a metrics CI failure as tested by the dummy PR from the test repo (https://github.com/kata-containers/tests/pull/2733). So I marked it as DNM for now.

The metrics CI is failing again on the check over the minimal memory footprint (and only for KSM enabled instances): 11:35:00 time="2020-08-14T18:35:00Z" level=warning msg="Failed Minval ( 269742.458500(minimal mean expected) > 269741.400000(minimal mean on dataset)) for [memory-footprint-ksm]". Full log is here: Log is here: http://jenkins.katacontainers.io/job/kata-metrics-tests-ubuntu-clh-18-04-PR/85/console.

One interesting observation is that the numbers are very close (1 out of 269000). The question is: how do we want to adjust the metrics CI to make sure it is working with static build of cloud-hypervisor? We either can 1) remove the check over the minimal memory footprint, or 2) adjust the reference number from the metrics CI. WDYT? @GabyCT @jcvenegas @amshinde @egernst

egernst commented 4 years ago

Once we understand the reason for the increased footprint, it would make sense to update the reference value.

We have more details on why? @likebreath @sameo?

likebreath commented 4 years ago

Once we understand the reason for the increased footprint, it would make sense to update the reference value.

We have more details on why? @likebreath @sameo?

@egernst Thanks for the prompt response. My understanding is that the checks is failing not because of increased memory footprint, but the opposite: the actual minimal memory footprint now is smaller than the reference value. This is how I understand the error message: 11:35:00 time="2020-08-14T18:35:00Z" level=warning msg="Failed Minval ( 269742.458500(minimal mean expected) > 269741.400000(minimal mean on dataset)) for [memory-footprint-ksm]. Please correct me if I am wrong.

likebreath commented 4 years ago

@kata-containers/packaging Please comment on the above discussions about how we want to proceed with the metrics CI failure, where the current minimum memory footprint is slightly smaller than the reference value.

Also, as we are preparing for next release, do we want to have it as a part of that release?

Thanks.

amshinde commented 4 years ago

@likebreath Is the decrease in memory footprint only in case of KSM? I thought that you were seeing an increase in memory footprint for static binary, which makes sense to me.

In any case, I feel we should remove the check on the min value, we should'nt really trigger a failure when we are seeing lower memory overhead. I would also readjust the min limits. @egernst @kata-containers/packaging Any thoughts on this?

likebreath commented 4 years ago

@likebreath Is the decrease in memory footprint only in case of KSM?

Right.

I thought that you were seeing an increase in memory footprint for static binary, which makes sense to me.

That is also what I thought when we got the initial metrics CI failure with this PR. For now (after several changes from the metrics CI/runtime/packaging), check in minval is the only failure.

In any case, I feel we should remove the check on the min value, we should'nt really trigger a failure when we are seeing lower memory overhead. I would also readjust the min limits.

This makes sense to me. Any pointers/help to adjust this in the CI?

amshinde commented 4 years ago

This makes sense to me. Any pointers/help to adjust this in the CI?

@GabyCT would be the best person to help you here.

chavafg commented 4 years ago

This is the file that needs to be updated: https://github.com/kata-containers/tests/blob/master/cmd/checkmetrics/ci_slaves/checkmetrics-json-clh-baremetal-kata-metric3.toml

likebreath commented 4 years ago

FYI: the pending PR https://github.com/kata-containers/tests/pull/2809 to loos the check on minval, which should resolve the metrics CI failure we have with this PR.

likebreath commented 4 years ago

Please help re-run the AzurePipelines. Just Rebased on master. I will run the CLH related CI jobs from the dummy PR https://github.com/kata-containers/tests/pull/2733 to make sure all CI jobs are passing (especially the metrics CI). /test

likebreath commented 4 years ago

I see some instability of our clh-docker CI, which mostly can be resolved by restarting the CI. After talking with @jcvenegas, we believe it might be related to the seccomp filters defined in clh.

The static build of cloud-hypervisor is using a different set of syscalls from its dynamic build. The static build of clh is less tested in our kata CI, so we may see more instabilities when using the static build of clh. I am setting up a local environment to manually run the clh-docker CI tests, from which I am hoping to find the missing syscalls from the seccomp list in clh if any.

Meanwhile, for the sake of stability, we may need to keep using the dynamic build of clh in our current beta release of Kata.

Also, I am thinking we may want to disable the seccomp functionality of clh in Kata for now while we are completing the seccomp filter list based on Kata's CI jobs/workload.

/cc @egernst @jcvenegas @amshinde @sameo @sboeuf

sboeuf commented 4 years ago

@likebreath thanks for the summary about your investigations. I agree disabling seccomp while you try to find out the missing syscalls is acceptable.

likebreath commented 4 years ago

@egernst @amshinde Please run the AzurePipeline and I think the PR is ready for being merged.

The seccomp option of clh in kata-runtime is now disabled temperately in kata-runtime, I no longer see the instability of the clh-docker CI (tested in https://github.com/kata-containers/tests/pull/2733). So I removed the DNM from this PR.

jodh-intel commented 4 years ago

/AzurePipelines run

azure-pipelines[bot] commented 4 years ago
Azure Pipelines successfully started running 2 pipeline(s).