Closed prsurve closed 1 year ago
It's happening from what I've observed when we have upgrade run
and it's ran also before the upgrade this test: test_add_capacity_pre_upgrade
so which I think we can suspect some issue in add_capacity operations when it's happening two times - once pre-upgrade and second time post-upgrade in the tier execution
I checked some executions in the past for the job you mentioned @Pratik Surve and previously we ran only upgrade_ocp marker - but from some time back we re-introduced also per and post upgrade tests for OCP upgrade job
and from that time I see that add capacity tests is failing with cannot allocate memory
https://url.corp.redhat.com/c0a49d4
here is OCS upgrade job only, and I see such failures in the May executions
It's happening from what I've observed when we have upgrade run
and it's ran also before the upgrade this test: test_add_capacity_pre_upgrade
so which I think we can suspect some issue in add_capacity operations when it's happening two times - once pre-upgrade and second time post-upgrade in the tier execution
I checked some executions in the past for the job you mentioned @pratik Surve and previously we ran only upgrade_ocp marker - but from some time back we re-introduced also per and post upgrade tests for OCP upgrade job
and from that time I see that add capacity tests is failing with cannot allocate memory
https://url.corp.redhat.com/c0a49d4
here is OCS upgrade job only, and I see such failures in the May executions
@petr-balogh What should be the fix? Shall we remove pre_upgrade run fo add capacity?
@ebenahar can you please take a look?
This runs also for ODF pre upgrade, not only OCP. Do we see this only for OCP upgrade runs? Also now we are running some jobs with both, OCP and ODF upgrade with one run. It will not be easy task but worth to invest some time to find the root cause what is consuming that much memory and what causing such issues.
just to make sure I get it - since we clubbed both ODF and OCP upgrades into the same run, we now do add capacity pre and post both OCP and ODF upgrades and also during tier1? this sums up to 5 add capacity operations. Please correct me if this is not the case. If it is the case, I think we better disable the add capacity that is done pre and post the OCP upgrade
My understanding is that we do this twice. Once as pre upgrade test - second time when we run tier1 after upgrade as part of tier1
@petr-balogh @prsurve how many ODF devicesets do we have at the end of the execution?
This issue was encountered once again. Run details:
URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6949/289361/289647/289648/log
Run ID: 1669941053
Test Case: test_add_capacity
ODF Build: 4.12.0-120
OCP Version: 4.12
Job name: VSPHERE6 UPI KMS VAULT V1 1AZ RHCOS VSAN 3M 3W tier1
Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6276/
Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-078vukv11cs33-t1/j-078vukv11cs33-t1_20221201T233022/logs/
Error Message: failed on teardown with "OSError: [Errno 12] Cannot allocate memory"
This issue was encountered once again. Run details:
URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6801/281551/281558/281562/log
Run ID: 1669747460
Test Case: test_ceph_default_values_check
ODF Build: 4.12.0-120
OCP Version: 4.12
Job name: AWS IPI FIPS ENCRYPTION 3AZ RHCOS 3M 3W 3I tier1 or tier_after_upgrade post upgrade
Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6182/
Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002aife3c333-uba/j-002aife3c333-uba_20221129T050922/logs/
Error Message: failed on setup with "OSError: [Errno 12] Cannot allocate memory"
This issue was encountered once again. Run details:
URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6794/281166/281450/281451/log
Run ID: 1669738462
Test Case: test_add_capacity
ODF Build: 4.12.0-120
OCP Version: 4.12
Job name: VSPHERE6 UPI KMS VAULT V1 1AZ RHCOS VSAN 3M 3W tier1 or tier_after_upgrade post upgrade
Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6207/
Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-003vukv11cs33-uba/j-003vukv11cs33-uba_20221129T113107/logs/
Error Message: failed on teardown with "OSError: [Errno 12] Cannot allocate memory"
This issue was encountered once again. Run details:
URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6790/280850/280857/280860/log
Run ID: 1669734217
Test Case: test_ceph_default_values_check
ODF Build: 4.12.0-120
OCP Version: 4.13
Job name: AWS IPI 3AZ RHCOS 3M 3W tier1 or tier_after_upgrade post upgrade
Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6183/
Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-157ai3c33-uo/j-157ai3c33-uo_20221129T063027/logs/
Error Message: failed on setup with "OSError: [Errno 12] Cannot allocate memory"
This issue was encountered once again. Run details:
URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6773/279533/279817/279819/log
Run ID: 1669717720
Test Case: test_add_capacity_lso
ODF Build: 4.12.0-120
OCP Version: 4.12
Job name: VSPHERE6 UPI ENCRYPTION 1AZ RHCOS VSAN LSO VMDK 3M 3W tier1
Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6203/
Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-092vue1cslv33-t1/j-092vue1cslv33-t1_20221129T092854/logs/
Error Message: failed on teardown with "OSError: [Errno 12] Cannot allocate memory"
This issue was encountered once again. Run details:
URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6765/279195/279202/279205/log
Run ID: 1669705224
Test Case: test_ceph_default_values_check
ODF Build: 4.12.0-120
OCP Version: 4.12
Job name: AWS IPI 3AZ RHCOS 3M 3W tier1 or tier_after_upgrade post upgrade
Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6178/
Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-011ai3c33-uba/j-011ai3c33-uba_20221128T175819/logs/
Error Message: failed on setup with "OSError: [Errno 12] Cannot allocate memory"
This issue was encountered once again. Run details:
URL: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#OCS/launches/362/6545/269251/269425/269426/log
Run ID: 1669185621
Test Case: test_create_storageclass_rbd
ODF Build: 4.12.0-114
OCP Version: 4.12
Job name: VSPHERE6 UPI Disconnected 1AZ RHCOS VSAN 3M 3W tier2
Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/6119/
Logs URL: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-001vud1cs33-t2/j-001vud1cs33-t2_20221123T044902/logs/
Error Message: OSError: [Errno 12] Cannot allocate memory
Failed consequently now during wait_for_rebalance() but with the same error OSError: [Errno 12] Cannot allocate memory
Reason: Can not allocate memory issue when calling
def get_rebalance_status(self):
"""
This function gets the rebalance status
Returns:
bool: True if rebalance is completed, False otherwise
"""
ceph_pod = pod.get_ceph_tools_pod()
ceph_status = ceph_pod.exec_ceph_cmd(ceph_cmd="ceph status")
ceph_health = ceph_pod.exec_ceph_cmd(ceph_cmd="ceph health")
total_pg_count = ceph_status["pgmap"]["num_pgs"]
pg_states = ceph_status["pgmap"]["pgs_by_state"]
logger.info(ceph_health)
logger.info(pg_states)
for states in pg_states:
return (
states["state_name"] == "active+clean"
and states["count"] == total_pg_count
)
Too much memory consumed by this cmd
Each out of the procedure takes ~2Mb more than 30k lines of text. Cmd invocated too aggressively, 53 times at 10 minutes.
Most consuming part is oc -n openshift-storage get Pod -n openshift-storage -o yaml
@ebenahar @petr-balogh
@DanielOsypenko would an increase to the amount of the Jenkins agent memory resolve this issue? @shyRozen FYI, you also had the idea of increasing the swap on the agents
@ebenahar both things may help. Please, take a look on link
Additionally framework optimization such as see pr;
need to replace shlex.split(cmd)
; I made 1000 calls of ceph_health_check with/without shlex.split(cmd)
and peaks of rss, vsz usage reduced seriously; etc.
@petr-balogh did a good start with tracking memory leak pr link, I want to proceed with it to see framework performance against tests. Will be nice to have graphs per process.
Hi Daniel, thanks for your analysis. @Petr Balogh @.***> Can we increase swap mem on the agents to see if it can solve the problem? We are hitting it in lots of UI tests also.
On Sun, Dec 18, 2022 at 3:21 PM Daniel Osypenko @.***> wrote:
@ebenahar https://github.com/ebenahar both things may help. Please, take a look on link https://serverfault.com/questions/317115/jenkins-ci-cannot-allocate-memory#:~:text=Add%20more%20physical%20memory/RAM%20to%20the%20machine.
Additionally framework optimization such as see pr https://github.com/red-hat-storage/ocs-ci/pull/6809; need to replace shlex.split(cmd); I made 1000 calls of ceph_health_check with/without shlex.split(cmd) and peaks of rss, vsz usage reduced seriously; etc.
@petr-balogh https://github.com/petr-balogh did a good start with tracking memory leak pr link https://github.com/red-hat-storage/ocs-ci/pull/5622, I want to proceed with it to see framework performance against tests. Will be nice to have graphs per process.
— Reply to this email directly, view it on GitHub https://github.com/red-hat-storage/ocs-ci/issues/6489#issuecomment-1356797380, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWB2OAE5ZZ2DA4QOGBM3M3WN4FWDANCNFSM6AAAAAAQRZ5UKQ . You are receiving this because you were mentioned.Message ID: @.***>
--
Shay Rozen
OCS QE
@. @.> T: +972547096705 @RedHat https://twitter.com/redhat Red Hat https://www.linkedin.com/company/red-hat Red Hat https://www.facebook.com/RedHatInc https://www.redhat.com/
@DanielOsypenko Could you pls check if this issue is resloved after merging the PR #6809? If not, do we have any ideas to address this issue?
@DanielOsypenko Could you pls check if this issue is resloved after merging the PR #6809? If not, do we have any ideas to address this issue?
Add capacity tests with post-upgrade configuration and rest of the tests from acceptance suit doesn't show the issue. Rest acceptance suit performed well without cannot allocate memory
issue almost 2 months. The issue can be closed as for now
def finalizer(): if not skipped: try: teardown = config.RUN["cli_params"]["teardown"] skip_ocs_deployment = config.ENV_DATA["skip_ocs_deployment"] ceph_cluster_installed = config.RUN.get("cephcluster") if not ( teardown or skip_ocs_deployment or mcg_only_deployment or not ceph_cluster_installed ):
tests/conftest.py:1422:
ocs_ci/utility/utils.py:2059: in ceph_health_check_base run_cmd( ocs_ci/utility/utils.py:473: in run_cmd completed_process = exec_cmd( ocs_ci/utility/utils.py:606: in exec_cmd completed_process = subprocess.run( /usr/lib64/python3.8/subprocess.py:493: in run with Popen(*popenargs, **kwargs) as process: /usr/lib64/python3.8/subprocess.py:858: in init self._execute_child(args, executable, preexec_fn, close_fds,
self = <subprocess.Popen object at 0x7fc5b015a5b0> args = ['oc', 'wait', '--for', 'condition=ready', 'pod', '-l', ...] executable = b'oc', preexec_fn = None, close_fds = True, pass_fds = () cwd = None, env = None, startupinfo = None, creationflags = 0, shell = False p2cread = 20, p2cwrite = 21, c2pread = 24, c2pwrite = 25, errread = 26 errwrite = 27, restore_signals = True, start_new_session = False
/usr/lib64/python3.8/subprocess.py:1639: OSError
Test result:- https://url.corp.redhat.com/46968f2
logs:- https://url.corp.redhat.com/79e857c