chaos mesh daily test: ch-benchmark-pg-cdc data verification failed

xuefengze commented 7 months ago

Chaos-mesh test failed(ch-benchmark-pg-cdc). The experiment made the meta unavailable for 20 seconds.

================================================================================
chaos-mesh Result
================================================================================
Result               FAIL                
Pipeline Message     Nightly ch-benchmark-pg-cdc
Namespace            longcmkf-20240227-190312
TestBed              medium-arm-all-affinity
RW Version           nightly-20240227    
Test Start time      2024-02-27 19:07:23 
Test End time        2024-02-27 19:50:13 
Test Queries         q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13,q14,q15,q17,q18,q19,q20,q21,q22
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=longcmkf-20240227-190312&from=1709060843000&to=1709063413000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=longcmkf-20240227-190312&from=1709060843000&to=1709063413000
Memory Dumps         https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&bucketType=general&prefix=k8s/longcmkf-20240227-190312/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/chaos-mesh/builds/629

Data verification fails:

revenue1 table/mv data checksum inconsistent
ch_benchmark_q4 table/mv data checksum inconsistent
order_line table/mv data checksum inconsistent

================================================================================
chaos-mesh Result
================================================================================
Result               FAIL                
Pipeline Message     Nightly ch-benchmark-pg-cdc
Namespace            longcmkf-20240227-195152
TestBed              medium-arm-all-affinity
RW Version           nightly-20240227    
Test Start time      2024-02-27 19:54:13 
Test End time        2024-02-27 21:53:46 
Test Queries         q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13,q14,q15,q17,q18,q19,q20,q21,q22
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=longcmkf-20240227-195152&from=1709063653000&to=1709070826000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=longcmkf-20240227-195152&from=1709063653000&to=1709070826000
Memory Dumps         https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&bucketType=general&prefix=k8s/longcmkf-20240227-195152/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/chaos-mesh/builds/629

Data verification fails:

revenue1 table/mv data checksum inconsistent
ch_benchmark_q10 table/mv data checksum inconsistent
order_line table/mv data checksum inconsistent
stock table/mv data checksum inconsistent

xuefengze commented 7 months ago

link https://github.com/risingwavelabs/risingwave/issues/15245 https://github.com/risingwavelabs/risingwave/issues/15190 https://github.com/risingwavelabs/risingwave/issues/15279

StrikeW commented 6 months ago

There is a null ptr issue in the log, which has been fixed in release-1.8, I suggest we rerun the pipeline with new image.

And I notice the pipeline is using the dedicated source, it is problematic in chaos-mesh test as explained in https://github.com/risingwavelabs/risingwave/issues/15141#issuecomment-1980429130. Please refactor the pipeline to ensure the historical data is empty or use share source instead. cc @cyliu0

xuefengze commented 6 months ago

BENCH_TESTBED="medium-arm-all-affinity"
CH_BENCHMARK_QUERY="q1,q2,q3,q4"
RW_VERSION="nightly-20240331"
BENCHMARK_TYPE="tpc"
TPC_TARGET="postgres"
duration="10m"
experiments="[{'kind': 'PodChaos', 'actions': ['pod-failure'], 'duration': '20s', 'cases': [{'mode': ['one'], 'label': {'key': 'risingwave/component', 'value': 'meta'}}]}]"

test failed with share source. https://buildkite.com/risingwave-test/chaos-mesh/builds/716

StrikeW commented 6 months ago

I found same error logs as in the #15141. We are working a fix patch to try to fix the issue, tracked in #15464

StrikeW commented 5 months ago

Hi @xuefengze, please rerun the test with RW_VERSION=cdc-commit-offset, which is the image of https://github.com/risingwavelabs/risingwave/pull/16058

xuefengze commented 5 months ago

https://buildkite.com/risingwave-test/chaos-mesh/builds/730#018eb902-b9e8-44f1-83a4-058cc8232d99 compute-0 got panicked when running q3

StrikeW commented 5 months ago

I run the chaos test with the image built by https://github.com/risingwavelabs/risingwave/pull/16058, the Cannot seek to the last known offset WARN logs are gone, which means the persistent source offset are always valid with the PR. But q4 still encounter source table inconsistent with upstream. Still investigating.

hzxa21 commented 5 months ago

This time there are more rows in the RW tables:

<html>
<body>
<!--StartFragment-->
{
--
  | "consistent": false,
  | "table-checksums": [
  | {
  | "url": "postgres://postgres:postgres@localhost:5432/postgres",
  | "table-name": "history",
  | "table-checksum": 8546249372229983623,
  | "table-rows": 46500
  | },
  | {
  | "url": "postgres://root:@127.0.0.1:4567/dev",
  | "table-name": "history",
  | "table-checksum": 1784133604083324500,
  | "table-rows": 46541
  | }
  | ]
  | },
  | {
  | "consistent": false,
  | "table-checksums": [
  | {
  | "url": "postgres://postgres:postgres@localhost:5432/postgres",
  | "table-name": "new_order",
  | "table-checksum": -758467578028615202,
  | "table-rows": 17
  | },
  | {
  | "url": "postgres://root:@127.0.0.1:4567/dev",
  | "table-name": "new_order",
  | "table-checksum": -5196837531638814068,
  | "table-rows": 23
  | }
  | ]
  | },
  | {
  | "consistent": false,
  | "table-checksums": [
  | {
  | "url": "postgres://postgres:postgres@localhost:5432/postgres",
  | "table-name": "orders",
  | "table-checksum": 5892817120355204331,
  | "table-rows": 140849
  | },
  | {
  | "url": "postgres://root:@127.0.0.1:4567/dev",
  | "table-name": "orders",
  | "table-checksum": 9047804123778707166,
  | "table-rows": 141184
  | }
  | ]
  | },
  | {
  | "consistent": false,
  | "table-checksums": [
  | {
  | "url": "postgres://postgres:postgres@localhost:5432/postgres",
  | "table-name": "order_line",
  | "table-checksum": 7067257756083074964,
  | "table-rows": 1408500
  | },
  | {
  | "url": "postgres://root:@127.0.0.1:4567/dev",
  | "table-name": "order_line",
  | "table-checksum": 8647488982357384687,
  | "table-rows": 1412341
  | }
  | ]
  | },

<!--EndFragment-->
</body>
</html>

StrikeW commented 5 months ago

I also run against the nightly image (750), the symptom is same: RW rows is more than upstream pg.

stdrc commented 4 months ago

During a previous refactor to source parser, I found a possible cause of cdc offset rewind, which according to @StrikeW is possibly a reason of this issue. The bug case is that, if there's always a transaction beginning in the middle of a source message batch, and ending in the next batch, we'll always yield stream chunks immediately when the transaction is committed. Then heartbeat messages to update offset can possibly be delayed forever. After a series of such transactions, the earliest heartbeat message may get emitted, but the offset may already be invalid then.

I created a PR #16608 to fix this bug. Let's hope it can fix this issue🙏

StrikeW commented 4 months ago

During a previous refactor to source parser, I found a possible cause of cdc offset rewind, which according to @StrikeW is possibly a reason of this issue. The bug case is that, if there's always a transaction beginning in the middle of a source message batch, and ending in the next batch, we'll always yield stream chunks immediately when the transaction is committed. Then heartbeat messages to update offset can possibly be delayed forever. After a series of such transactions, the earliest heartbeat message may get emitted, but the offset may already be invalid then.

I created a PR #16608 to fix this bug. Let's hope it can fix this issue🙏

I am building an image for the branch, let's see whether the issue can be fixed.

StrikeW commented 4 months ago

I run the chaos mesh test against the PR, but it fails in q3. https://buildkite.com/risingwave-test/chaos-mesh/builds/816#018f5610-37ba-450c-8c59-067fa74e648e

But the nightly image can pass the test.

BENCH_TESTBED=medium-arm-all-affinity
CH_BENCHMARK_QUERY=q1,q2,q3,q4
RW_VERSION=nightly-20240507
BENCHMARK_TYPE=tpc
TPC_TARGET=postgres
duration=10m
experiments=[{'kind': 'PodChaos', 'actions': ['pod-failure'], 'duration': '20s', 'cases': [{'mode': ['one'], 'label': {'key': 'risingwave/component', 'value': 'meta'}}]}]

I think we should take a look before merge the PR.

risingwavelabs / risingwave

chaos mesh daily test: ch-benchmark-pg-cdc data verification failed #15312