drop and reimplement io_in_bg

mbukatov commented 3 years ago

Issues with current io_in_bg design:

workload is running in a thread of ocs-ci pytest run
log pollution (log messages of io_in_bg ends up scattered in logs of test cases)
prone to errors
increased cost of ci analysis

Proposed solution:

don't use threads to run the background io
remove all bg io code from ocs-ci
implement the bg io via fio container (with configuration provided via a config map) k8s deployment, which could be scaled up and down to control it's run

To be considered:

only test cases which will indicate that they can handle background io would be considered for bg io run

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.

mbukatov commented 2 years ago

I"m running into this every CI run, should not be flagged as stale, unless ~io_in_bf~ is dropped.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.

github-actions[bot] commented 2 years ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

github-actions[bot] commented 2 years ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

github-actions[bot] commented 2 years ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

ebenahar commented 2 years ago

@Akarsha-rai @mbukatov as of now, the logging of IO in BG is quite minimal. Should we still consider disabling it for the general use case?

mbukatov commented 2 years ago

I still believe it's worth to convert it to a deployment or a job, and then run it in selected test runs only while giving the tests a way to disable it temporarily if necessary.

This would address at least the interference issue, I usually see few cases of tests failed on stopping IO in BG.

ebenahar commented 2 years ago

Converting to a job would indeed improve the stabilization. We can keep this issue to track such conversion.

As for the option to pause the IO exists here - https://github.com/red-hat-storage/ocs-ci/blob/49d8a77774cc8568714d7502051e846f9ef247fe/tests/conftest.py#L1712

Here is a reference for its usage - https://github.com/red-hat-storage/ocs-ci/blob/3b04171d5cfaea8921419164f96517a098201354/tests/ecosystem/upgrade/test_resources.py#L51

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.

github-actions[bot] commented 1 year ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

github-actions[bot] commented 1 year ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

github-actions[bot] commented 1 year ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

DanielOsypenko commented 1 year ago

https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/493/12999/595182/595183/595185/log?logParams=history%3D595185%26page.page%3D1

DanielOsypenko commented 1 year ago

taking this for implementation

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.

DanielOsypenko commented 10 months ago

test_monitoring_reporting_ok_when_idle

ALL THE FAILED TESTS WERE RUNNING ON EXTERNAL MODE CLUSTER

up to 10% of tests running on ODF 4.13 Fail with:

Message: failed on setup with "Exception: io_in_bf failed to stop after 600 timeout, bug in io_in_bf (of ocs-ci) prevents execution of test cases which uses this fixture, rerun the affected test cases in a dedicated run and consider ocs-ci fix" Type: None Text: measurement_dir = '/tmp/pytest-of-jenkins/pytest-1/measurement_results' threading_lock = <unlocked _thread.lock object at 0x7fb3ebd763c0>

reason - we need to turn On ceph metrics on external clsuter before test run PR #8457

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.

github-actions[bot] commented 6 months ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

DanielOsypenko commented 2 months ago

fio pod fail to stop on setUp of test_monitoring_reporting_ok_when_idle-> https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/678/25003/1212518/1212581/log

DanielOsypenko commented 2 months ago

fio pod fail to stop on setUp of test_monitoring_reporting_ok_when_idle-> https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/678/24984/1211235/1211298/log

DanielOsypenko commented 2 months ago

as a previous 2 comments, the failure to stop IO running on background occurs with post-upgrade deployment https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/678/24978/1211090/1211153/log

DanielOsypenko commented 2 months ago

same as 3 prev failures: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/13170/consoleText

DanielOsypenko commented 2 months ago

test that uses the fixture workload_idle in the pre-upgrade, post-upgrade scenarios will always fail. So this test fails 2 times during one pipeline execution. The reason, we are using it along with Fio running in the background which is being set by Jenkins job.

I see two ERRORS/BUGS, that may be not related and happen during one execution

The workload_idle fixture is not able to stop io running in a background, probably because it is being multiple time reduced and increased when we are reaching the load.
The second time when we start cluster_workload fixture it fails to make the fio pod run.

The existing notification suggests rerunning this test separately.

my observation: I agree with issue creator that we need to stick with fio container along with ConfigMap to setup and modify the load


[2024-09-04T07:21:15.320Z] 03:21:14 - MainThread - tests.conftest - [32mINFO[0m  - Start running IO in the background. The amount of IO that will be written is going to be determined by the cluster capabilities according to its limit

[2024-09-04T09:19:44.590Z] tests/functional/upgrade/test_upgrade_ocp.py::TestUpgradeOCP::test_upgrade_ocp 
[2024-09-04T09:19:44.590Z] [1m-------------------------------- live log setup --------------------------------[0m
...

[2024-09-04T12:56:28.245Z] tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::test_monitoring_reporting_ok_when_idle 
[2024-09-04T12:56:28.245Z] [1m-------------------------------- live log setup --------------------------------[0m
...

[2024-09-04T13:40:06.101Z] [1m[31mE               Exception: io_in_bf failed to stop after 600 timeout, bug in io_in_bf (of ocs-ci) prevents execution of test cases which uses this fixture, rerun the affected test cases in a dedicated run and consider ocs-ci fix[0m[2024-09-04T13:40:06.101Z] 
[2024-09-04T13:40:06.101Z] [1m[31mtests/functional/monitoring/conftest.py[0m:856: Exception
...

[2024-09-04T14:27:31.032Z] 10:27:30 - MainThread - tests.conftest - [32mINFO[0m  - Start running IO in the background. The amount of IO that will be written is going to be determined by the cluster capabilities according to its limit
..

[2024-09-04T14:27:52.288Z] 10:27:52 - MainThread - ocs_ci.ocs.ocp - [32mINFO[0m  - status of  at column STATUS - item(s) were ['ContainerCreating'], but we were waiting for all of them to be Running

[2024-09-04T14:29:30.196Z] 10:29:29 - MainThread - tests.conftest - [31m[1mERROR[0m  - Cluster load might not work correctly during this run, because it failed with an exception: list index out of range

[2024-09-04T17:23:25.597Z] tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::test_monitoring_reporting_ok_when_idle 
[2024-09-04T17:23:25.597Z] [1m-------------------------------- live log setup --------------------------------[0m

DanielOsypenko commented 1 month ago

update: https://url.corp.redhat.com/f0285b2

DanielOsypenko commented 1 month ago

@nehaberry , @ebenahar, @hnallurv can I allocate time to resolve this after 4.18? The scale of job should be comparable to M-size feature. It affects all regression runs with io-on-background, affects a large portion of blue_squad tests and more.

Reimplementing this we should pay attention to:

upgrade scenarios with io-on-background
multicluster runs
Provider mode Deployments and test runs (define strategy and customization, e.g. to run on Clients first)
test execution time. Workload tests take most of the execution time, stretching tier1 to 1 day or more on large clusters
should be stable
should be informative reporting the utilization stats

DanielOsypenko commented 1 month ago

to consider https://github.com/kube-burner/kube-burner

DanielOsypenko commented 1 month ago

test_cephfs_capacity_workload_alerts: fill-up stuck on 78.9%, test executed 11 hours : 31 minutes : 12 seconds,

[2024-10-18T16:24:45.205Z] 12:24:45 - MainThread - ocs_ci.ocs.fiojob - [31m[1mERROR[0m  - Job fio failed to write 405 Gi data on OCS backed volume in expected time 41472.0 seconds. If the fio pod were still runing (see 'last actual status was' in some previous log message), this is caused either by severe product performance regression or by a misconfiguration of the clusterr, ping infra team.

https://url.corp.redhat.com/9a72ab1

DanielOsypenko commented 5 days ago

another tool that is more robust than dd - "stress-ng" this works together with benchmarking tool "sysbench"

stress-ng --hdd 1 --hdd-bytes 10G --timeout 60s

sysbench fileio --file-total-size=10G --file-test-mode=seqwr prepare
sysbench fileio --file-total-size=10G --file-test-mode=seqwr run

red-hat-storage / ocs-ci

drop and reimplement io_in_bg #5177