red-hat-storage / ocs-ci

https://ocs-ci.readthedocs.io/en/latest/
MIT License
108 stars 166 forks source link

Add capacity test is failing with msg OSError: [Errno 12] Cannot allocate memory #6489

Closed prsurve closed 1 year ago

prsurve commented 2 years ago

def finalizer(): if not skipped: try: teardown = config.RUN["cli_params"]["teardown"] skip_ocs_deployment = config.ENV_DATA["skip_ocs_deployment"] ceph_cluster_installed = config.RUN.get("cephcluster") if not ( teardown or skip_ocs_deployment or mcg_only_deployment or not ceph_cluster_installed ):

              ceph_health_check_base()

tests/conftest.py:1422:


ocs_ci/utility/utils.py:2059: in ceph_health_check_base run_cmd( ocs_ci/utility/utils.py:473: in run_cmd completed_process = exec_cmd( ocs_ci/utility/utils.py:606: in exec_cmd completed_process = subprocess.run( /usr/lib64/python3.8/subprocess.py:493: in run with Popen(*popenargs, **kwargs) as process: /usr/lib64/python3.8/subprocess.py:858: in init self._execute_child(args, executable, preexec_fn, close_fds,


self = <subprocess.Popen object at 0x7fc5b015a5b0> args = ['oc', 'wait', '--for', 'condition=ready', 'pod', '-l', ...] executable = b'oc', preexec_fn = None, close_fds = True, pass_fds = () cwd = None, env = None, startupinfo = None, creationflags = 0, shell = False p2cread = 20, p2cwrite = 21, c2pread = 24, c2pwrite = 25, errread = 26 errwrite = 27, restore_signals = True, start_new_session = False

def _execute_child(self, args, executable, preexec_fn, close_fds,
                   pass_fds, cwd, env,
                   startupinfo, creationflags, shell,
                   p2cread, p2cwrite,
                   c2pread, c2pwrite,
                   errread, errwrite,
                   restore_signals, start_new_session):
    """Execute program (POSIX version)"""

    if isinstance(args, (str, bytes)):
        args = [args]
    elif isinstance(args, os.PathLike):
        if shell:
            raise TypeError('path-like args is not allowed when '
                            'shell is true')
        args = [args]
    else:
        args = list(args)

    if shell:
        # On Android the default shell is at '/system/bin/sh'.
        unix_shell = ('/system/bin/sh' if
                  hasattr(sys, 'getandroidapilevel') else '/bin/sh')
        args = [unix_shell, "-c"] + args
        if executable:
            args[0] = executable

    if executable is None:
        executable = args[0]

    sys.audit("subprocess.Popen", executable, args, cwd, env)

    if (_USE_POSIX_SPAWN
            and os.path.dirname(executable)
            and preexec_fn is None
            and not close_fds
            and not pass_fds
            and cwd is None
            and (p2cread == -1 or p2cread > 2)
            and (c2pwrite == -1 or c2pwrite > 2)
            and (errwrite == -1 or errwrite > 2)
            and not start_new_session):
        self._posix_spawn(args, executable, env, restore_signals,
                          p2cread, p2cwrite,
                          c2pread, c2pwrite,
                          errread, errwrite)
        return

    orig_executable = executable

    # For transferring possible exec failure from child to parent.
    # Data format: "exception name:hex errno:description"
    # Pickle is not used; it is complex and involves memory allocation.
    errpipe_read, errpipe_write = os.pipe()
    # errpipe_write must not be in the standard io 0, 1, or 2 fd range.
    low_fds_to_close = []
    while errpipe_write < 3:
        low_fds_to_close.append(errpipe_write)
        errpipe_write = os.dup(errpipe_write)
    for low_fd in low_fds_to_close:
        os.close(low_fd)
    try:
        try:
            # We must avoid complex work that could involve
            # malloc or free in the child process to avoid
            # potential deadlocks, thus we do all this here.
            # and pass it to fork_exec()

            if env is not None:
                env_list = []
                for k, v in env.items():
                    k = os.fsencode(k)
                    if b'=' in k:
                        raise ValueError("illegal environment variable name")
                    env_list.append(k + b'=' + os.fsencode(v))
            else:
                env_list = None  # Use execv instead of execve.
            executable = os.fsencode(executable)
            if os.path.dirname(executable):
                executable_list = (executable,)
            else:
                # This matches the behavior of os._execvpe().
                executable_list = tuple(
                    os.path.join(os.fsencode(dir), executable)
                    for dir in os.get_exec_path(env))
            fds_to_keep = set(pass_fds)
            fds_to_keep.add(errpipe_write)
          self.pid = _posixsubprocess.fork_exec(

args, executable_list, close_fds, tuple(sorted(map(int, fds_to_keep))), cwd, env_list, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, errpipe_read, errpipe_write, restore_signals, start_new_session, preexec_fn) E OSError: [Errno 12] Cannot allocate memory

/usr/lib64/python3.8/subprocess.py:1639: OSError

Test result:- https://url.corp.redhat.com/46968f2

logs:- https://url.corp.redhat.com/79e857c

petr-balogh commented 2 years ago

It's happening from what I've observed when we have upgrade run

and it's ran also before the upgrade this test: test_add_capacity_pre_upgrade

so which I think we can suspect some issue in add_capacity operations when it's happening two times - once pre-upgrade and second time post-upgrade in the tier execution

I checked some executions in the past for the job you mentioned @Pratik Surve and previously we ran only upgrade_ocp marker - but from some time back we re-introduced also per and post upgrade tests for OCP upgrade job

and from that time I see that add capacity tests is failing with cannot allocate memory

https://url.corp.redhat.com/c0a49d4

here is OCS upgrade job only, and I see such failures in the May executions

am-agrawa commented 1 year ago

It's happening from what I've observed when we have upgrade run

and it's ran also before the upgrade this test: test_add_capacity_pre_upgrade

so which I think we can suspect some issue in add_capacity operations when it's happening two times - once pre-upgrade and second time post-upgrade in the tier execution

I checked some executions in the past for the job you mentioned @pratik Surve and previously we ran only upgrade_ocp marker - but from some time back we re-introduced also per and post upgrade tests for OCP upgrade job

and from that time I see that add capacity tests is failing with cannot allocate memory

https://url.corp.redhat.com/c0a49d4

here is OCS upgrade job only, and I see such failures in the May executions

@petr-balogh What should be the fix? Shall we remove pre_upgrade run fo add capacity?

petr-balogh commented 1 year ago

@ebenahar can you please take a look?

This runs also for ODF pre upgrade, not only OCP. Do we see this only for OCP upgrade runs? Also now we are running some jobs with both, OCP and ODF upgrade with one run. It will not be easy task but worth to invest some time to find the root cause what is consuming that much memory and what causing such issues.

ebenahar commented 1 year ago

just to make sure I get it - since we clubbed both ODF and OCP upgrades into the same run, we now do add capacity pre and post both OCP and ODF upgrades and also during tier1? this sums up to 5 add capacity operations. Please correct me if this is not the case. If it is the case, I think we better disable the add capacity that is done pre and post the OCP upgrade

petr-balogh commented 1 year ago

My understanding is that we do this twice. Once as pre upgrade test - second time when we run tier1 after upgrade as part of tier1

ebenahar commented 1 year ago

@petr-balogh @prsurve how many ODF devicesets do we have at the end of the execution?

ebenahar commented 1 year ago

This issue was encountered once again. Run details:

URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6949/289361/289647/289648/log Run ID: 1669941053 Test Case: test_add_capacity ODF Build: 4.12.0-120 OCP Version: 4.12 Job name: VSPHERE6 UPI KMS VAULT V1 1AZ RHCOS VSAN 3M 3W tier1 Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6276/ Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-078vukv11cs33-t1/j-078vukv11cs33-t1_20221201T233022/logs/ Error Message: failed on teardown with "OSError: [Errno 12] Cannot allocate memory"

ebenahar commented 1 year ago

This issue was encountered once again. Run details:

URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6801/281551/281558/281562/log Run ID: 1669747460 Test Case: test_ceph_default_values_check ODF Build: 4.12.0-120 OCP Version: 4.12 Job name: AWS IPI FIPS ENCRYPTION 3AZ RHCOS 3M 3W 3I tier1 or tier_after_upgrade post upgrade Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6182/ Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002aife3c333-uba/j-002aife3c333-uba_20221129T050922/logs/ Error Message: failed on setup with "OSError: [Errno 12] Cannot allocate memory"

ebenahar commented 1 year ago

This issue was encountered once again. Run details:

URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6794/281166/281450/281451/log Run ID: 1669738462 Test Case: test_add_capacity ODF Build: 4.12.0-120 OCP Version: 4.12 Job name: VSPHERE6 UPI KMS VAULT V1 1AZ RHCOS VSAN 3M 3W tier1 or tier_after_upgrade post upgrade Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6207/ Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-003vukv11cs33-uba/j-003vukv11cs33-uba_20221129T113107/logs/ Error Message: failed on teardown with "OSError: [Errno 12] Cannot allocate memory"

ebenahar commented 1 year ago

This issue was encountered once again. Run details:

URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6790/280850/280857/280860/log Run ID: 1669734217 Test Case: test_ceph_default_values_check ODF Build: 4.12.0-120 OCP Version: 4.13 Job name: AWS IPI 3AZ RHCOS 3M 3W tier1 or tier_after_upgrade post upgrade Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6183/ Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-157ai3c33-uo/j-157ai3c33-uo_20221129T063027/logs/ Error Message: failed on setup with "OSError: [Errno 12] Cannot allocate memory"

ebenahar commented 1 year ago

This issue was encountered once again. Run details:

URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6773/279533/279817/279819/log Run ID: 1669717720 Test Case: test_add_capacity_lso ODF Build: 4.12.0-120 OCP Version: 4.12 Job name: VSPHERE6 UPI ENCRYPTION 1AZ RHCOS VSAN LSO VMDK 3M 3W tier1 Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6203/ Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-092vue1cslv33-t1/j-092vue1cslv33-t1_20221129T092854/logs/ Error Message: failed on teardown with "OSError: [Errno 12] Cannot allocate memory"

ebenahar commented 1 year ago

This issue was encountered once again. Run details:

URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6765/279195/279202/279205/log Run ID: 1669705224 Test Case: test_ceph_default_values_check ODF Build: 4.12.0-120 OCP Version: 4.12 Job name: AWS IPI 3AZ RHCOS 3M 3W tier1 or tier_after_upgrade post upgrade Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6178/ Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-011ai3c33-uba/j-011ai3c33-uba_20221128T175819/logs/ Error Message: failed on setup with "OSError: [Errno 12] Cannot allocate memory"

ebenahar commented 1 year ago

This issue was encountered once again. Run details:

URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6545/269251/269425/269426/log Run ID: 1669185621 Test Case: test_create_storageclass_rbd ODF Build: 4.12.0-114 OCP Version: 4.12 Job name: VSPHERE6 UPI Disconnected 1AZ RHCOS VSAN 3M 3W tier2 Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6119/ Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-001vud1cs33-t2/j-001vud1cs33-t2_20221123T044902/logs/ Error Message: OSError: [Errno 12] Cannot allocate memory

DanielOsypenko commented 1 year ago

Failed consequently now during wait_for_rebalance() but with the same error OSError: [Errno 12] Cannot allocate memory

Reason: Can not allocate memory issue when calling

def get_rebalance_status(self):
    """
    This function gets the rebalance status

    Returns:
        bool: True if rebalance is completed, False otherwise

    """

    ceph_pod = pod.get_ceph_tools_pod()
    ceph_status = ceph_pod.exec_ceph_cmd(ceph_cmd="ceph status")
    ceph_health = ceph_pod.exec_ceph_cmd(ceph_cmd="ceph health")
    total_pg_count = ceph_status["pgmap"]["num_pgs"]
    pg_states = ceph_status["pgmap"]["pgs_by_state"]
    logger.info(ceph_health)
    logger.info(pg_states)
    for states in pg_states:
        return (
            states["state_name"] == "active+clean"
            and states["count"] == total_pg_count
        )

Too much memory consumed by this cmd Each out of the procedure takes ~2Mb more than 30k lines of text. Cmd invocated too aggressively, 53 times at 10 minutes. Most consuming part is oc -n openshift-storage get Pod -n openshift-storage -o yaml

@ebenahar @petr-balogh

ebenahar commented 1 year ago

@DanielOsypenko would an increase to the amount of the Jenkins agent memory resolve this issue? @shyRozen FYI, you also had the idea of increasing the swap on the agents

DanielOsypenko commented 1 year ago

@ebenahar both things may help. Please, take a look on link

Additionally framework optimization such as see pr; need to replace shlex.split(cmd); I made 1000 calls of ceph_health_check with/without shlex.split(cmd) and peaks of rss, vsz usage reduced seriously; etc.

@petr-balogh did a good start with tracking memory leak pr link, I want to proceed with it to see framework performance against tests. Will be nice to have graphs per process.

shyRozen commented 1 year ago

Hi Daniel, thanks for your analysis. @Petr Balogh @.***> Can we increase swap mem on the agents to see if it can solve the problem? We are hitting it in lots of UI tests also.

On Sun, Dec 18, 2022 at 3:21 PM Daniel Osypenko @.***> wrote:

@ebenahar https://github.com/ebenahar both things may help. Please, take a look on link https://serverfault.com/questions/317115/jenkins-ci-cannot-allocate-memory#:~:text=Add%20more%20physical%20memory/RAM%20to%20the%20machine.

Additionally framework optimization such as see pr https://github.com/red-hat-storage/ocs-ci/pull/6809; need to replace shlex.split(cmd); I made 1000 calls of ceph_health_check with/without shlex.split(cmd) and peaks of rss, vsz usage reduced seriously; etc.

@petr-balogh https://github.com/petr-balogh did a good start with tracking memory leak pr link https://github.com/red-hat-storage/ocs-ci/pull/5622, I want to proceed with it to see framework performance against tests. Will be nice to have graphs per process.

— Reply to this email directly, view it on GitHub https://github.com/red-hat-storage/ocs-ci/issues/6489#issuecomment-1356797380, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWB2OAE5ZZ2DA4QOGBM3M3WN4FWDANCNFSM6AAAAAAQRZ5UKQ . You are receiving this because you were mentioned.Message ID: @.***>

--

Shay Rozen

OCS QE

@. @.> T: +972547096705 @RedHat https://twitter.com/redhat Red Hat https://www.linkedin.com/company/red-hat Red Hat https://www.facebook.com/RedHatInc https://www.redhat.com/

am-agrawa commented 1 year ago

@DanielOsypenko Could you pls check if this issue is resloved after merging the PR #6809? If not, do we have any ideas to address this issue?

DanielOsypenko commented 1 year ago

@DanielOsypenko Could you pls check if this issue is resloved after merging the PR #6809? If not, do we have any ideas to address this issue?

Add capacity tests with post-upgrade configuration and rest of the tests from acceptance suit doesn't show the issue. Rest acceptance suit performed well without cannot allocate memory issue almost 2 months. The issue can be closed as for now