Open xuefengze opened 7 months ago
There is a null ptr issue in the log, which has been fixed in release-1.8, I suggest we rerun the pipeline with new image.
And I notice the pipeline is using the dedicated source, it is problematic in chaos-mesh test as explained in https://github.com/risingwavelabs/risingwave/issues/15141#issuecomment-1980429130. Please refactor the pipeline to ensure the historical data is empty or use share source instead. cc @cyliu0
BENCH_TESTBED="medium-arm-all-affinity"
CH_BENCHMARK_QUERY="q1,q2,q3,q4"
RW_VERSION="nightly-20240331"
BENCHMARK_TYPE="tpc"
TPC_TARGET="postgres"
duration="10m"
experiments="[{'kind': 'PodChaos', 'actions': ['pod-failure'], 'duration': '20s', 'cases': [{'mode': ['one'], 'label': {'key': 'risingwave/component', 'value': 'meta'}}]}]"
test failed with share source. https://buildkite.com/risingwave-test/chaos-mesh/builds/716
I found same error logs as in the #15141. We are working a fix patch to try to fix the issue, tracked in #15464
Hi @xuefengze, please rerun the test with RW_VERSION=cdc-commit-offset
, which is the image of https://github.com/risingwavelabs/risingwave/pull/16058
https://buildkite.com/risingwave-test/chaos-mesh/builds/730#018eb902-b9e8-44f1-83a4-058cc8232d99 compute-0 got panicked when running q3
I run the chaos test with the image built by https://github.com/risingwavelabs/risingwave/pull/16058, the Cannot seek to the last known offset
WARN logs are gone, which means the persistent source offset are always valid with the PR. But q4 still encounter source table inconsistent with upstream. Still investigating.
This time there are more rows in the RW tables:
<html>
<body>
<!--StartFragment-->
{
--
| "consistent": false,
| "table-checksums": [
| {
| "url": "postgres://postgres:postgres@localhost:5432/postgres",
| "table-name": "history",
| "table-checksum": 8546249372229983623,
| "table-rows": 46500
| },
| {
| "url": "postgres://root:@127.0.0.1:4567/dev",
| "table-name": "history",
| "table-checksum": 1784133604083324500,
| "table-rows": 46541
| }
| ]
| },
| {
| "consistent": false,
| "table-checksums": [
| {
| "url": "postgres://postgres:postgres@localhost:5432/postgres",
| "table-name": "new_order",
| "table-checksum": -758467578028615202,
| "table-rows": 17
| },
| {
| "url": "postgres://root:@127.0.0.1:4567/dev",
| "table-name": "new_order",
| "table-checksum": -5196837531638814068,
| "table-rows": 23
| }
| ]
| },
| {
| "consistent": false,
| "table-checksums": [
| {
| "url": "postgres://postgres:postgres@localhost:5432/postgres",
| "table-name": "orders",
| "table-checksum": 5892817120355204331,
| "table-rows": 140849
| },
| {
| "url": "postgres://root:@127.0.0.1:4567/dev",
| "table-name": "orders",
| "table-checksum": 9047804123778707166,
| "table-rows": 141184
| }
| ]
| },
| {
| "consistent": false,
| "table-checksums": [
| {
| "url": "postgres://postgres:postgres@localhost:5432/postgres",
| "table-name": "order_line",
| "table-checksum": 7067257756083074964,
| "table-rows": 1408500
| },
| {
| "url": "postgres://root:@127.0.0.1:4567/dev",
| "table-name": "order_line",
| "table-checksum": 8647488982357384687,
| "table-rows": 1412341
| }
| ]
| },
<!--EndFragment-->
</body>
</html>
I also run against the nightly image (750), the symptom is same: RW rows is more than upstream pg.
During a previous refactor to source parser, I found a possible cause of cdc offset rewind, which according to @StrikeW is possibly a reason of this issue. The bug case is that, if there's always a transaction beginning in the middle of a source message batch, and ending in the next batch, we'll always yield stream chunks immediately when the transaction is committed. Then heartbeat messages to update offset can possibly be delayed forever. After a series of such transactions, the earliest heartbeat message may get emitted, but the offset may already be invalid then.
I created a PR #16608 to fix this bug. Let's hope it can fix this issue🙏
During a previous refactor to source parser, I found a possible cause of cdc offset rewind, which according to @StrikeW is possibly a reason of this issue. The bug case is that, if there's always a transaction beginning in the middle of a source message batch, and ending in the next batch, we'll always yield stream chunks immediately when the transaction is committed. Then heartbeat messages to update offset can possibly be delayed forever. After a series of such transactions, the earliest heartbeat message may get emitted, but the offset may already be invalid then.
I created a PR #16608 to fix this bug. Let's hope it can fix this issue🙏
I am building an image for the branch, let's see whether the issue can be fixed.
I run the chaos mesh test against the PR, but it fails in q3. https://buildkite.com/risingwave-test/chaos-mesh/builds/816#018f5610-37ba-450c-8c59-067fa74e648e
But the nightly image can pass the test.
BENCH_TESTBED=medium-arm-all-affinity
CH_BENCHMARK_QUERY=q1,q2,q3,q4
RW_VERSION=nightly-20240507
BENCHMARK_TYPE=tpc
TPC_TARGET=postgres
duration=10m
experiments=[{'kind': 'PodChaos', 'actions': ['pod-failure'], 'duration': '20s', 'cases': [{'mode': ['one'], 'label': {'key': 'risingwave/component', 'value': 'meta'}}]}]
I think we should take a look before merge the PR.
Chaos-mesh test failed(ch-benchmark-pg-cdc). The experiment made the meta unavailable for 20 seconds.
Data verification fails:
Data verification fails: