red-hat-storage / ocs-ci

https://ocs-ci.readthedocs.io/en/latest/
MIT License
109 stars 166 forks source link

stabilize test_stop_start_node_validate_topology #9519

Open DanielOsypenko opened 5 months ago

DanielOsypenko commented 5 months ago

these lines produce failure

nodes.stop_nodes(nodes=[random_node_under_test], force=True) api = prometheus.PrometheusAPI(threading_lock=threading_lock)

we need to give more time after stopping node or exclude to stop node with prometheus resided.

tests/cross_functional/ui/test_odf_topology.py:278

https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/19188/932102/932105/log?logParams=history%3D932105%26page.page%3D1

DanielOsypenko commented 5 months ago

another issue - before tearDown wait the Mon pods to finish redeploy after node back Online tests/cross_functional/ui/test_odf_topology.py:52

def ceph_health_check_base(namespace=None): """ Exec ceph health cmd on tools pod to determine health of cluster.

Args:
    namespace (str): Namespace of OCS
        (default: config.ENV_DATA['cluster_namespace'])

Raises:
    CephHealthException: If the ceph health returned is not HEALTH_OK
    CommandFailed: If the command to retrieve the tools pod name or the
        command to get ceph health returns a non-zero exit code
Returns:
    boolean: True if HEALTH_OK

"""

health = run_ceph_health_cmd(namespace) -> 

if health.strip() == "HEALTH_OK":
    log.info("Ceph cluster health is HEALTH_OK.")
    return True
else:

  raise CephHealthException(f"Ceph cluster health is not OK. Health: {health}")

E ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN 1/3 mons down, quorum a,b; 1 osds down; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 417/1251 objects degraded (33.333%), 79 pgs degraded, 113 pgs undersized

DanielOsypenko commented 5 months ago

another issue happened at least 4 times:

can not find "alerts_sidebar_tab" (//span[normalize-space()='Alerts']") function open_alerts_tab() add redundancy. Screenshot bellow https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/19256/935160/935163/log

image

DanielOsypenko commented 5 months ago

with lower requirements ODF console fails after node stopped

image

 ```

nodes.stop_nodes(nodes=[random_node_under_test], force=True)

    api = prometheus.PrometheusAPI(threading_lock=threading_lock)
    logger.info(f"Verifying whether {constants.ALERT_NODEDOWN} has been triggered")
    alerts = api.wait_for_alert(name=constants.ALERT_NODEDOWN, state="firing")
    test_checks = dict()
    test_checks["prometheus_CephNodeDown_alert_fired"] = len(alerts) > 0
    if not test_checks["prometheus_CephNodeDown_alert_fired"]:
        logger.error(
            f"Prometheus alert '{constants.ALERT_NODEDOWN}' is not triggered"
        )
    else:
        logger.info(f"alerts found: {str(alerts)}")

    min_wait_for_update = 3
    logger.info(f"wait {min_wait_for_update}min to get UI updated with alert")
    time.sleep(min_wait_for_update * 60)

    topology_tab = PageNavigator().nav_odf_default_page().nav_topology_tab()


https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/35243/

job restarted https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/35243/
github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.

DanielOsypenko commented 1 month ago

need to fix an order of teardown functions. LIFO order

DanielOsypenko commented 1 day ago

backport to 4.15