Closed likebreath closed 4 years ago
/test-ubuntu
/test-clh
As pointed by @jcvenegas, the clh CI now is broken and is being fixed by two pending PRs (#1098 and https://github.com/kata-containers/runtime/pull/2843). We will need to land those two PRs before we can merge this one.
/test-clh
/test-clh
/test
/test-clh /test-ubuntu /AzurePipelines run
/AzurePipelines run
/test-ubuntu
/test-clh
/test-clh /test-ubuntu /AzurePipelines run
/AzurePipelines run
Although the PR itself passed all necessary checks, we are still facing a metrics CI failure as tested by the dummy PR from the test repo (https://github.com/kata-containers/tests/pull/2733). So I marked it as DNM for now.
The metrics CI is failing again on the check over the minimal memory footprint (and only for KSM enabled instances): 11:35:00 time="2020-08-14T18:35:00Z" level=warning msg="Failed Minval ( 269742.458500(minimal mean expected) > 269741.400000(minimal mean on dataset)) for [memory-footprint-ksm]"
. Full log is here: Log is here: http://jenkins.katacontainers.io/job/kata-metrics-tests-ubuntu-clh-18-04-PR/85/console.
One interesting observation is that the numbers are very close (1 out of 269000). The question is: how do we want to adjust the metrics CI to make sure it is working with static build of cloud-hypervisor? We either can 1) remove the check over the minimal memory footprint, or 2) adjust the reference number from the metrics CI. WDYT? @GabyCT @jcvenegas @amshinde @egernst
Once we understand the reason for the increased footprint, it would make sense to update the reference value.
We have more details on why? @likebreath @sameo?
Once we understand the reason for the increased footprint, it would make sense to update the reference value.
We have more details on why? @likebreath @sameo?
@egernst Thanks for the prompt response. My understanding is that the checks is failing not because of increased memory footprint, but the opposite: the actual minimal memory footprint now is smaller than the reference value. This is how I understand the error message: 11:35:00 time="2020-08-14T18:35:00Z" level=warning msg="Failed Minval ( 269742.458500(minimal mean expected) > 269741.400000(minimal mean on dataset)) for [memory-footprint-ksm]
. Please correct me if I am wrong.
@kata-containers/packaging Please comment on the above discussions about how we want to proceed with the metrics CI failure, where the current minimum memory footprint is slightly smaller than the reference value.
Also, as we are preparing for next release, do we want to have it as a part of that release?
Thanks.
@likebreath Is the decrease in memory footprint only in case of KSM? I thought that you were seeing an increase in memory footprint for static binary, which makes sense to me.
In any case, I feel we should remove the check on the min value, we should'nt really trigger a failure when we are seeing lower memory overhead. I would also readjust the min limits. @egernst @kata-containers/packaging Any thoughts on this?
@likebreath Is the decrease in memory footprint only in case of KSM?
Right.
I thought that you were seeing an increase in memory footprint for static binary, which makes sense to me.
That is also what I thought when we got the initial metrics CI failure with this PR. For now (after several changes from the metrics CI/runtime/packaging), check in minval is the only failure.
In any case, I feel we should remove the check on the min value, we should'nt really trigger a failure when we are seeing lower memory overhead. I would also readjust the min limits.
This makes sense to me. Any pointers/help to adjust this in the CI?
This makes sense to me. Any pointers/help to adjust this in the CI?
@GabyCT would be the best person to help you here.
This is the file that needs to be updated: https://github.com/kata-containers/tests/blob/master/cmd/checkmetrics/ci_slaves/checkmetrics-json-clh-baremetal-kata-metric3.toml
FYI: the pending PR https://github.com/kata-containers/tests/pull/2809 to loos the check on minval
, which should resolve the metrics CI failure we have with this PR.
Please help re-run the AzurePipelines. Just Rebased on master. I will run the CLH related CI jobs from the dummy PR https://github.com/kata-containers/tests/pull/2733 to make sure all CI jobs are passing (especially the metrics CI). /test
I see some instability of our clh-docker CI, which mostly can be resolved by restarting the CI. After talking with @jcvenegas, we believe it might be related to the seccomp
filters defined in clh.
The static build of cloud-hypervisor is using a different set of syscalls from its dynamic build. The static build of clh is less tested in our kata CI, so we may see more instabilities when using the static build of clh. I am setting up a local environment to manually run the clh-docker CI tests, from which I am hoping to find the missing syscalls from the seccomp list in clh if any.
Meanwhile, for the sake of stability, we may need to keep using the dynamic build of clh in our current beta release of Kata.
Also, I am thinking we may want to disable the seccomp
functionality of clh in Kata for now while we are completing the seccomp filter list based on Kata's CI jobs/workload.
/cc @egernst @jcvenegas @amshinde @sameo @sboeuf
@likebreath thanks for the summary about your investigations. I agree disabling seccomp while you try to find out the missing syscalls is acceptable.
@egernst @amshinde Please run the AzurePipeline and I think the PR is ready for being merged.
The seccomp
option of clh in kata-runtime is now disabled temperately in kata-runtime, I no longer see the instability of the clh-docker
CI (tested in https://github.com/kata-containers/tests/pull/2733). So I removed the DNM from this PR.
/AzurePipelines run
This patch reuses the 'dev_cli.sh' from cloud-hypervisor repo to build a static binary of clh without creating our own docker image nor installing extra dependencies.
Fixes: #1033
Signed-off-by: Bo Chen chen.bo@intel.com
Note: As we don't have full CI coverage in this repo, I am using a dummy PR from the tests repo (https://github.com/kata-containers/tests/pull/2733) to verify this change is passing all CI jobs related to cloud-hypervisor.