oracle / oracle-database-operator

The Oracle Database Operator for Kubernetes (a.k.a. OraOperator) helps developers, DBAs, DevOps and GitOps teams reduce the time and complexity of deploying and managing Oracle Databases. It eliminates the dependency on a human operator or administrator for the majority of database operations.
Universal Permissive License v1.0
141 stars 45 forks source link

AutonomousDatabase CR stuck at Provisioning or Terminating #112

Closed laguvard closed 1 month ago

laguvard commented 5 months ago

During a series of about a dozen AutonomousDatabase create/delete cycles, the CR remained stuck 3 times at either the Provisioning or the Terminating state even after the OCI ADB transitioned to Available or Terminated. This was due to a transient 500 error.

1) It appears that there was no attempt to recover. The logs show:

DEBUG 2024/06/14 04:14:56.145788 asm_amd64.s:1650: Retry policy to use: {MaximumNumberAttempts=1, MinSleepBetween=0, MaxSleepBetween=0, ExponentialBackoffBase=0, NonEventuallyConsistentPolicy=<nil>}

2) There was no indication on the CR (e.g., a condition) to show the error.

Here's the error:

INFO 2024/06/14 04:14:57.577023 client.go:463: Dump Response HTTP/1.1 500 Internal Server Error^M
Connection: close^M
Content-Length: 502^M
Cache-Control: must-revalidate,no-cache,no-store^M
Content-Type: text/html;charset=iso-8859-1^M
Date: Fri, 14 Jun 2024 04:14:56 GMT^M
Opc-Request-Id: 527875f6252c7e89ba40f484d3b47ef9/8077408D57D9A0CABBEC59A796F17A7F/C0B75A80CCCA38741AFB0B4820DD1D99^M
Strict-Transport-Security: max-age=31536000; includeSubDomains;^M
X-Content-Type-Options: nosniff^M
^M
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 500 Request failed.</title>
</head>
<body><h2>HTTP ERROR 500 Request failed.</h2>
<table>
<tr><th>URI:</th><td>/20160918/autonomousDatabases/ocid1.autonomousdatabase.oc1.iad.anuwcljs3lzhqhyadm6k5rdkjelgvsbp3k3c4ym4jqaivgemcsp6mgoqgqeq</td></tr>
<tr><th>STATUS:</th><td>500</td></tr>
<tr><th>MESSAGE:</th><td>Request failed.</td></tr>
<tr><th>SERVLET:</th><td>jersey</td></tr>
</table>

</body>
</html>
DEBUG 2024/06/14 04:14:57.577078 client.go:465: Error response could not be parsed due to: invalid character '<' looking for beginning of value
2024-06-14T04:14:57Z    ERROR   controllers.database.AutonomousDatabase.validateOperation.manageError   UpdateFailed    {"Namespace/Name": {"name":"oraclepdb-152abed7-367a-4631-b94e-156cf7b3ae59-adb","namespace":"test"}, "error": "[Error returned by Database Service. Http Status Code: 500. Error Code: BadErrorResponse. Opc request id: c35ed8b3f385c5d8191cfce81e1e6ee0/27F2B9A32C934160A70414F24047B67E/F7C056F133118AEEBC3207A1A6CD87FA. Message: Failed to parse json from response body due to: invalid character '<' looking for beginning of value. With response body <html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html;charset=ISO-8859-1\"/>\n<title>Error 500 Request failed.</title>\n</head>\n<body><h2>HTTP ERROR 500 Request failed.</h2>\n<table>\n<tr><th>URI:</th><td>/20160918/autonomousDatabases/ocid1.autonomousdatabase.oc1.iad.anuwcljs3lzhqhyadm6k5rdkjelgvsbp3k3c4ym4jqaivgemcsp6mgoqgqeq</td></tr>\n<tr><th>STATUS:</th><td>500</td></tr>\n<tr><th>MESSAGE:</th><td>Request failed.</td></tr>\n<tr><th>SERVLET:</th><td>jersey</td></tr>\n</table>\n\n</body>\n</html>\n.\nOperation Name: GetAutonomousDatabase\nTimestamp: 2024-06-14 04:14:54 +0000 GMT\nClient Version: Oracle-GoSDK/65.49.3\nRequest Endpoint: GET https://database.us-ashburn-1.oraclecloud.com/20160918/autonomousDatabases/ocid1.autonomousdatabase.oc1.iad.anuwcljs3lzhqhyadm6k5rdkjelgvsbp3k3c4ym4jqaivgemcsp6mgoqgqeq
laguvard commented 5 months ago

Version 1.1.0

ting-lan-wang commented 1 month ago

Hi @laguvard thanks for reporting the issues. For the first one, it can be fixed by increasing the retry number in each requests, and make the controller requeue the request when the OCI ADB is in intermediate states like Provisioning or Terminating. The both issues will be fixed in the next release.

laguvard commented 1 month ago

Thanks @ting-lan-wang. Does Closed mean that you've merged the fixes?

ting-lan-wang commented 1 month ago

Hi @laguvard The fixes were delivered internally. They will go public in the next version (v1.2.0).