kruize / autotune

Autonomous Performance Tuning for Kubernetes!
Apache License 2.0
157 stars 53 forks source link

Kruize remote monitoring functional failures due to 502 from listRecommendations #1281

Open chandrams opened 1 week ago

chandrams commented 1 week ago

Describe the bug Kruize remote monitoring functional tests are failing with different issues on openshift with latest kruize 0.0.24_mvp image

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_release_tests/139/ - Kruize scalelab https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/ - kruize scalelab

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/testReport/junit/rest_apis/test_list_recommendations/test_list_recommendations_profile_notifications_cpu_zero_test_1_True_update_metrics0_323002_CPU_usage_is_zero__No_CPU_Recommendations_can_be_generated_/

      data = response.json()

test_list_recommendations.py:2867: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib/python3.6/site-packages/requests/models.py:897: in json
    return complexjson.loads(self.text, **kwargs)
/usr/lib64/python3.6/json/__init__.py:354: in loads
    return _default_decoder.decode(s)
/usr/lib64/python3.6/json/decoder.py:339: in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <json.decoder.JSONDecoder object at 0x7f45e9154160>
s = '<html>\r\n  <head>\r\n    <meta name="viewport" content="width=device-width, initial-scale=1">\r\n\r\n    <style type...t least one pod running.\r\n          </li>\r\n        </ul>\r\n      </div>\r\n    </div>\r\n  </body>\r\n</html>\r\n'
idx = 0

    def raw_decode(self, s, idx=0):
        """Decode a JSON document from ``s`` (a ``str`` beginning with
        a JSON document) and return a 2-tuple of the Python
        representation and the index in ``s`` where the document ended.

        This can be used to decode a JSON document from a string that may
        have extraneous data at the end.

        """
        try:
            obj, end = self.scan_once(s, idx)
        except StopIteration as err:
>           raise JSONDecodeError("Expecting value", s, err.value) from None
E           json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

/usr/lib64/python3.6/json/decoder.py:357: JSONDecodeError

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/testReport/junit/rest_apis/test_list_recommendations/test_list_recommendations_cpu_mem_optimised/

                 response = list_recommendations(experiment_name)
>                   assert response.status_code == SUCCESS_200_STATUS_CODE
E                   assert 502 == 200
E                    +  where 502 = <Response [502]>.status_code

test_list_recommendations.py:2740: AssertionError
chandrams commented 1 week ago

Commented out test_list_recommendations_cpu_mem_optimised test that failed with 502 error and running the sanity testsuite manually, all the tests passed now. Will run the entire testsuite and check again.

msvinaykumar commented 1 week ago

I see an error occurring while creating the experiment. It could be related to the state, such as whether Kruize and its related pods, including the database service, are ready to handle the request.

chandrams commented 1 week ago

Yes, that create experiment issue failed due to 502 error in this job, hence commented the below test & other tests work fine.

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/testReport/junit/rest_apis/test_list_recommendations/test_list_recommendations_cpu_mem_optimised/

We need to check why 502 occurs when we run the entire sanity bucket.

chandrams commented 1 week ago

Commented out test_list_recommendations_cpu_mem_optimised test that failed with 502 error and running the sanity testsuite manually, all the tests passed now. Will run the entire testsuite and check again.

Two new tests failed now, after commenting the above test and running the entire functional testsuite manually, due to 502 error response from list recommendations:

Listing the recommendations...
URL =  http://kruize-openshift-tuning.apps.kruize-scalelab.h0b5.p1.openshiftapps.com/listRecommendations
PARAMS =  {'experiment_name': 'quarkus-resteasy-kruize-min-http-response-time-db_0'}
Response status code =  502

************************************************************
<html><body><h1>502 Bad Gateway</h1>
The server returned an invalid or incomplete response.
</body></html>
.
.
FAILED test_list_recommendations.py::test_list_recommendations_for_diff_reco_terms_with_only_latest[long_term_test_true-15-reco_json_schema4-360.0-True-False]
FAILED test_list_recommendations.py::test_list_recommendations_for_diff_reco_terms_with_only_latest[long_term_test_false-15-reco_json_schema5-360.0-False-False]
========== 17 failed, 10 passed, 334 deselected in 2597.64s (0:43:17) ==========

########### Results Summary of the test suite remote_monitoring_tests ##########
remote_monitoring_tests took 2988 seconds
Number of tests performed 358
Number of tests passed 313
Number of tests failed 45

~~~~~~~~~~~~~~~~~~~~~~~ remote_monitoring_tests failed ~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed cases are :
                  negative
                  extended
chandrams commented 1 week ago

Executed the test suite again the above 2 failures are not seen, 502 error issue seems to be intermittent

########### Results Summary of the test suite remote_monitoring_tests ##########
remote_monitoring_tests took 5843 seconds
Number of tests performed 358
Number of tests passed 315
Number of tests failed 43

~~~~~~~~~~~~~~~~~~~~~~~ remote_monitoring_tests failed ~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed cases are :
          negative
          extended

Check Log Directory: /home/jenkins/test_res_alltests_0.0.24_skip_cpu_mem_optimized/kruize_test_results/kruize_20240904:07:45:37/remote_monitoring_tests for failed cases 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

************************************** done *************************************

*********************************************************************************
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Overall summary of the tests ~~~~~~~~~~~~~~~~~~~~~~~
Total time taken to perform the test 5843 seconds
Total Number of test suites performed 1
Total Number of tests performed 358
Total Number of tests passed 315
Total Number of tests failed 43

These 43 failures are due to known issues.

Executed only the sanity bucket by enabling the skipped test - test_list_recommendations_cpu_mem_optimised test, it passed, didn't see the 502 error.

########### Results Summary of the test suite remote_monitoring_tests ##########
remote_monitoring_tests took 2051 seconds
Number of tests performed 155
Number of tests passed 155
Number of tests failed 0

~~~~~~~~~~~~~~~~~~~~~~ remote_monitoring_tests passed ~~~~~~~~~~~~~~~~~~~~~~~~~~

************************************** done *************************************
chandrams commented 1 week ago

Logs of another sanity run that failed with kruize pod restart test_res_sanity_functional_0.0.24.zip

chandrams commented 4 days ago

I have run one of the failing tests alone with the below builds, here are the results:

pytest -s test_list_recommendations.py::test_list_recommendations_cpu_mem_optimised --cluster_type openshift

Executed this test 5 times:

With 0.0.22_mvp, did not see the failure (could be very intermittent though I did not see the failure in 5 runs) With 0.0.23_mvp, test failed 2 out of 5 runs With 0.0.24_mvp, test failed 2 out of 5 runs

Note: When the test fails kruize pod is restarted

@msvinaykumar @khansaad - Can you please take a look at this issue.