risingwavelabs / risingwave-operator

RisingWave Kubernetes Operator
https://www.risingwave.com/cloud
Apache License 2.0
75 stars 18 forks source link

chore: bump risingwave to v1.8.2 #642

Closed arkbriar closed 2 months ago

arkbriar commented 2 months ago

What's changed and what's your intention?

PLEASE DO NOT LEAVE THIS EMPTY !!!

Please explain IN DETAIL what the changes are in this PR and why they are needed:

Checklist

Refer to a related PR or issue link (optional)

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 55.98%. Comparing base (c940c86) to head (c3d98b4). Report is 2 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #642 +/- ## ======================================= Coverage 55.98% 55.98% ======================================= Files 38 38 Lines 6436 6436 ======================================= Hits 3603 3603 Misses 2725 2725 Partials 108 108 ``` | [Flag](https://app.codecov.io/gh/risingwavelabs/risingwave-operator/pull/642/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=risingwavelabs) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/risingwavelabs/risingwave-operator/pull/642/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=risingwavelabs) | `55.98% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=risingwavelabs#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

arkbriar commented 2 months ago

Due to a recent change in v1.8.1, compute nodes won't be available for executing queries until the meta succeeds to connect to it. That means the k8s DNS setup will put a delay to the availability.

It won't affect the existing operator behavior, but the current readiness status (a.k.a., Running==true) won't be a good signal to indicate the status of RW anymore. The E2E tests failed because of the reason.

A typical symptom is that queries will fail when run immediately after RW's up (based on the above condition). The client-side error message will look like:

Caused by these errors (recent errors listed first):
  1: gRPC request to meta service failed: The service is currently unavailable
  2: Service unavailable: No available parallel units to schedule

cc. @wjf3121 @lukeraphael for notice.

The PR will be hold until the problem is resolved. A probable solution will be updating the readiness condition for all compute nodes so that they will be ready after the meta recognizes them. It won't work, there will be a dead lock between the readiness and DNS setup.

arkbriar commented 2 months ago

Pushed a commit to try fixing the e2e by waiting the compute to be ready in meta.

arkbriar commented 2 months ago

The problem is caused by the slow compute readiness, which relies on the meta to be ready first and the k8s setup the DNS. After compute connects to the meta, it then starts its server. I think the best solutions is to use a separate port for health check of compute/compactor nodes.

arkbriar commented 2 months ago

Merge queue failed

CREATE_TABLE
CREATE_MATERIALIZED_VIEW
ERROR:  Failed to run the query

Caused by:
  internal error: failed to create RPC client to e2e-compute-0.e2e-compute:5688

command terminated with exit code 3
arkbriar commented 2 months ago

I'm pretty sure that the problem is caused by a sync latency between the CoreDNS pods. I have rerun the risingwave::multi_meta test case lots of times. Finally, I captured the DNS resolution failure. The DNS meta requested succeeded in resolving the address of compute node, but the one frontend requested while handling the query failed.

The latency seemed pretty high (>10s) from the logs. I presume it's abnormal. The reason why deploying with v1.7.3 doesn't encounter this problem is still unknown. I have to dive deeper.

Mon May  6 14:59:45 +08 2024
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   e2e-compute-0.e2e-compute.e2e-0.svc.cluster.local
Address: 10.1.5.4

Mon May  6 14:59:47 +08 2024
Server:         10.96.0.10
Address:        10.96.0.10#53

** server can't find e2e-compute-0.e2e-compute: NXDOMAIN

command terminated with exit code 1
Mon May  6 14:59:49 +08 2024
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   e2e-compute-0.e2e-compute.e2e-0.svc.cluster.local
Address: 10.1.5.4

Mon May  6 14:59:52 +08 2024
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   e2e-compute-0.e2e-compute.e2e-0.svc.cluster.local
Address: 10.1.5.4

Mon May  6 14:59:54 +08 2024
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   e2e-compute-0.e2e-compute.e2e-0.svc.cluster.local
Address: 10.1.5.4

Mon May  6 14:59:56 +08 2024
Server:         10.96.0.10
Address:        10.96.0.10#53

** server can't find e2e-compute-0.e2e-compute: NXDOMAIN

command terminated with exit code 1
Mon May  6 14:59:59 +08 2024
Server:         10.96.0.10
Address:        10.96.0.10#53

** server can't find e2e-compute-0.e2e-compute: NXDOMAIN
arkbriar commented 2 months ago

Find a probable explanation from another issue:

If you queried the headless service before it was created, then the negative response would be cached. However, the TTL for records served by the kubernetes plugin defaults to 5 seconds, so those negative responses should only be cached by the cache plugin for 5 seconds.

The latency was probably caused by a cache of failure. The delay can now be explained since meta tries to resolve the compute's address periodically, and it will probably have requested to both DNS servers via the service before success. In the local (or kind) setup, the TTL and cache TTL are both 30s. That means the first failure of both server will be cached for 30s. The delay between the first failures will be the delay between their first correct resolutions.

arkbriar commented 2 months ago

Increased the wait period to 30s to ensure the cache invalidation.