Open mbukatov opened 3 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.
I"m running into this every CI run, should not be flagged as stale, unless ~io_in_bf~ is dropped.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
@Akarsha-rai @mbukatov as of now, the logging of IO in BG is quite minimal. Should we still consider disabling it for the general use case?
I still believe it's worth to convert it to a deployment or a job, and then run it in selected test runs only while giving the tests a way to disable it temporarily if necessary.
This would address at least the interference issue, I usually see few cases of tests failed on stopping IO in BG.
Converting to a job would indeed improve the stabilization. We can keep this issue to track such conversion.
As for the option to pause the IO exists here - https://github.com/red-hat-storage/ocs-ci/blob/49d8a77774cc8568714d7502051e846f9ef247fe/tests/conftest.py#L1712
Here is a reference for its usage - https://github.com/red-hat-storage/ocs-ci/blob/3b04171d5cfaea8921419164f96517a098201354/tests/ecosystem/upgrade/test_resources.py#L51
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
taking this for implementation
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.
test_monitoring_reporting_ok_when_idle
ALL THE FAILED TESTS WERE RUNNING ON EXTERNAL MODE CLUSTER
up to 10% of tests running on ODF 4.13 Fail with:
Message: failed on setup with "Exception: io_in_bf failed to stop after 600 timeout, bug in io_in_bf (of ocs-ci) prevents execution of test cases which uses this fixture, rerun the affected test cases in a dedicated run and consider ocs-ci fix" Type: None Text: measurement_dir = '/tmp/pytest-of-jenkins/pytest-1/measurement_results' threading_lock = <unlocked _thread.lock object at 0x7fb3ebd763c0>
reason - we need to turn On ceph metrics on external clsuter before test run PR #8457
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
fio pod fail to stop on setUp of test_monitoring_reporting_ok_when_idle-> https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/678/25003/1212518/1212581/log
fio pod fail to stop on setUp of test_monitoring_reporting_ok_when_idle-> https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/678/24984/1211235/1211298/log
as a previous 2 comments, the failure to stop IO running on background occurs with post-upgrade deployment https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/678/24978/1211090/1211153/log
test that uses the fixture workload_idle in the pre-upgrade, post-upgrade scenarios will always fail. So this test fails 2 times during one pipeline execution. The reason, we are using it along with Fio running in the background which is being set by Jenkins job.
I see two ERRORS/BUGS, that may be not related and happen during one execution
workload_idle
fixture is not able to stop io running in a background, probably because it is being multiple time reduced and increased when we are reaching the load.cluster_workload
fixture it fails to make the fio pod run. The existing notification suggests rerunning this test separately.
my observation: I agree with issue creator that we need to stick with fio container along with ConfigMap to setup and modify the load
[2024-09-04T07:21:15.320Z] 03:21:14 - MainThread - tests.conftest - [32mINFO[0m - Start running IO in the background. The amount of IO that will be written is going to be determined by the cluster capabilities according to its limit
[2024-09-04T09:19:44.590Z] tests/functional/upgrade/test_upgrade_ocp.py::TestUpgradeOCP::test_upgrade_ocp
[2024-09-04T09:19:44.590Z] [1m-------------------------------- live log setup --------------------------------[0m
...
[2024-09-04T12:56:28.245Z] tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::test_monitoring_reporting_ok_when_idle
[2024-09-04T12:56:28.245Z] [1m-------------------------------- live log setup --------------------------------[0m
...
[2024-09-04T13:40:06.101Z] [1m[31mE Exception: io_in_bf failed to stop after 600 timeout, bug in io_in_bf (of ocs-ci) prevents execution of test cases which uses this fixture, rerun the affected test cases in a dedicated run and consider ocs-ci fix[0m[2024-09-04T13:40:06.101Z]
[2024-09-04T13:40:06.101Z] [1m[31mtests/functional/monitoring/conftest.py[0m:856: Exception
...
[2024-09-04T14:27:31.032Z] 10:27:30 - MainThread - tests.conftest - [32mINFO[0m - Start running IO in the background. The amount of IO that will be written is going to be determined by the cluster capabilities according to its limit
..
[2024-09-04T14:27:52.288Z] 10:27:52 - MainThread - ocs_ci.ocs.ocp - [32mINFO[0m - status of at column STATUS - item(s) were ['ContainerCreating'], but we were waiting for all of them to be Running
[2024-09-04T14:29:30.196Z] 10:29:29 - MainThread - tests.conftest - [31m[1mERROR[0m - Cluster load might not work correctly during this run, because it failed with an exception: list index out of range
[2024-09-04T17:23:25.597Z] tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::test_monitoring_reporting_ok_when_idle
[2024-09-04T17:23:25.597Z] [1m-------------------------------- live log setup --------------------------------[0m
@nehaberry , @ebenahar, @hnallurv can I allocate time to resolve this after 4.18? The scale of job should be comparable to M-size feature. It affects all regression runs with io-on-background, affects a large portion of blue_squad tests and more.
Reimplementing this we should pay attention to:
io-on-background
to consider https://github.com/kube-burner/kube-burner
test_cephfs_capacity_workload_alerts
: fill-up stuck on 78.9%, test executed 11 hours : 31 minutes : 12 seconds,
[2024-10-18T16:24:45.205Z] 12:24:45 - MainThread - ocs_ci.ocs.fiojob - [31m[1mERROR[0m - Job fio failed to write 405 Gi data on OCS backed volume in expected time 41472.0 seconds. If the fio pod were still runing (see 'last actual status was' in some previous log message), this is caused either by severe product performance regression or by a misconfiguration of the clusterr, ping infra team.
another tool that is more robust than dd - "stress-ng" this works together with benchmarking tool "sysbench"
stress-ng --hdd 1 --hdd-bytes 10G --timeout 60s
sysbench fileio --file-total-size=10G --file-test-mode=seqwr prepare
sysbench fileio --file-total-size=10G --file-test-mode=seqwr run
Issues with current io_in_bg design:
Proposed solution:
To be considered: