kubevirt / containerized-data-importer

Data Import Service for kubernetes, designed with kubevirt in mind.
Apache License 2.0
399 stars 254 forks source link

Prlimit Unit Test fails on s390x #3341

Closed Davo911 closed 1 week ago

Davo911 commented 1 month ago

What happened: In #3226 it was discovered, that prlimit_test.go fails on s390x, when using a specific go version. The go version was then bumped to 1.22.3 and the Issue was closed, but the Problem is not yet resolved.

• [FAILED] [0.010 seconds]
Process Limits exec [It] command success with real limits
/root/go/src/kubevirt.io/containerized-data-importer/pkg/system/prlimit_test.go:62

  [FAILED] Expected
      <string>: unexpected fault address 0x180000
       fatal error: fault
       [signal SIGSEGV: segmentation violation code=0x2 addr=0x180000 pc=0x18002a]
       goroutine 1 gp=0xc0000021c0 m=0 mp=0x8241a0 [running, locked to thread]:
       runtime.throw({0x4edfa2...

There are currently 2 possible fixes for this:

  1. omitting the -coverprofile=.coverprofile flag from the go-test command https://github.com/kubevirt/containerized-data-importer/blob/364d582f76d5f50b23ecc1321537908d07139767/hack/build/run-unit-tests.sh#L26
  2. increasing the memory allocation from 1GB(1 << 30) to 2GB(1 << 31) https://github.com/kubevirt/containerized-data-importer/blob/364d582f76d5f50b23ecc1321537908d07139767/pkg/system/prlimit_test.go#L62

The second option seems to be the correct one, but not sure as both fixes do not seem to be related to me.


What you expected to happen: Unit Tests Succeeding

How to reproduce it (as minimally and precisely as possible): On a s390x LPAR:

  1. clone main
  2. export BUILDER_IMAGE=<ported builder image for s390x>
  3. Either a). go test -coverprofile=.coverprofile ./pkg/system/... b). ./hack/build/bazel-docker.sh ./hack/build/run-unit-tests.sh ./pkg/system/..."

Environment:

akalenyu commented 1 month ago
  • CDI version : v1.58.1

So this branch of CDI is using older golang, maybe that is the problem?

Davo911 commented 1 month ago

This was tested with go 1.22.3 and 1.22.4. The original Issue (#3226) also needed to omit the coverprofile flag as a workaround

akalenyu commented 1 month ago

The unit tests run is containerized, this is why I am implying the branch matters

Davo911 commented 1 month ago

yep, thanks for pointing that out I noted the CDI version of the instance deployed on my cluster by mistake, but the branch I'm on is latest(1.59), when running the Unit tests here

akalenyu commented 1 month ago

Go 1.22 only made it to main, so it's not in 1.59

Davo911 commented 1 month ago

Oh meant main What could be arguments against increasing that limit, at least architecture dependend to 2GB? As long as I understand it, this 1GB stems from a recommandation made at OpenStack to prevent the allocation of an arbitrary amount of memory when using qemu-img

akalenyu commented 1 month ago

Yeah that sounds about right, but, ideally I would like to understand what is it exactly that s390x doesn't like in this particular case? Is there a nonlegitimate request in that? seems completely fine to me

akalenyu commented 1 month ago

We discussed this in the sig-storage call, what we can do as a quick fix is to bump the memory limit to 2Gi depending on the arch.

However, ideally, we would like to understand specifically why this fails on s390x; We have some ideas, like, maybe the coverage tracking is really expensive and badly optimized for s390x, or, maybe we were always close to exhausting the memory in the unit tests.

Davo911 commented 3 weeks ago

I wasn't successful yet in troubleshooting the issue through memory profiling or valgrind, but I found a value, where the tests gets flaky. At a limit of 1509000000 roughly 40-50% of the time the test succeeds

Davo911 commented 2 weeks ago

Additionally the same Error can happen under amd64: https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_containerized-data-importer/3349/pull-cdi-unit-test/1818300000342380544#1:build-log.txt%3A1206-1240

A further investigation is neccesary here

akalenyu commented 1 week ago

Additionally the same Error can happen under amd64: https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_containerized-data-importer/3349/pull-cdi-unit-test/1818300000342380544#1:build-log.txt%3A1206-1240

A further investigation is neccesary here

Interesting.. I would run the unit tests locally and follow the memory consumption of the container, but if I'm not mistaken this simply means that the 8Gi limit on the CI job for unit tests does not cut it, Which is surprising to me even if this unit test alone is taking 1Gi.

Davo911 commented 1 week ago

Additionally the same Error can happen under amd64: prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_containerized-data-importer/3349/pull-cdi-unit-test/1818300000342380544#1:build-log.txt:1206-1240

A further investigation is neccesary here

Small correction here to pull back a little: The one coming up on that flaky amd64 run is not the same error as for the original s390x one.

<string>: fatal error: runtime: cannot allocate memory vs <string>: unexpected fault address 0x180000\nfatal error: fault\n[signal SIGSEGV: segmentation violation

The first one can easily be reproduced by lowering the limit to 1<<29 (~500MB) or below.

akalenyu commented 5 days ago

yeah tbh today I noticed the unit test container spike over 4Gi even using docker stats. As I see it runtime: cannot allocate memory is not really exhausting the available memory. I think in that case, we would have seen a segfault