risingwavelabs / risingwave

SQL stream processing, analytics, and management. We decouple storage and compute to offer efficient joins, instant failover, dynamic scaling, speedy bootstrapping, and concurrent query serving.
https://www.risingwave.com/slack
Apache License 2.0
6.62k stars 543 forks source link

Persistent Storage (S3) grows indefinitely #15963

Open rotten opened 3 months ago

rotten commented 3 months ago

Describe the bug

We are observing that the file storage in S3 continues to grow indefinitely, even on an idle database.

Error message/log

There are no errors associated with this issue.

To Reproduce

Graph your S3 BucketSizeBytes CloudWatch metric for idle databases to observe this.

Expected behavior

We kind of expected the "compactor" to actually compact the data and the size to decrease when there aren't many changes happening. At the least we'd expect the size to remain relatively constant on a database that is mostly idle. In our load testing environment, which is not idle, the rate of growth seems to be much faster and larger than the rate of data change. It never recovers, even when changes are undone and left to "rest" for a bit.

How did you deploy RisingWave?

We are deployed in ECS.

The version of RisingWave

PostgreSQL 9.5.0-RisingWave-1.7.2 (6ecc083c72b83a96e7e3b58ca3a19b28cd5f74bf)

Additional context

S3 data in & out can be pricey. What is it doing that it just keeps writing "stuff" into S3 and burning money?

hzxa21 commented 3 months ago

This is strange. We have built-in compaction and vacuum process in RisingWave to compact and delete stale data on object store. Can you share the meta node and compactor node logs?

In our load testing environment, which is not idle, the rate of growth seems to be much faster and larger than the rate of data change. It never recovers, even when changes are undone and left to "rest" for a bit.

Can you briefly describe the workload? By undoing the changes, do you mean deleting the rows?

It may contain some internal concepts but I will try my best to describe the compaction and vacuum process in RisingWave to help you understand the expected behavior:

  1. Compaction will be triggered based on a rule based strategy. The triggering rules are:

    • The LSM tree is imbalance. You can check "Compaction - Lsm Compact Pending Bytes" panel in the dev grafana dashbord to roughly see the imbalanceness of the LSM tree.
    • Tombstones in the data file exceed tombstone_reclaim_ratio. You can use select * from rw_hummock_compaction_group_configs to check the tombstone_recliam_ratio for the LSTM tree (we have one LSM tree per compaction group). For the tombstone ratio for files in you cluster, you can use the following SQL to check the files with the top 10 tombstone ratio to see whether there is any file with tombstone ratio larger than tombstone_reclaim_ratio:
      SELECT compaction_group_id,
         level_id,
         sstable_id,
         stale_key_count * 1.0 / total_key_count AS tombstone_ratio
      FROM   rw_catalog.rw_hummock_sstables
      ORDER  BY tombstone_ratio desc limit 10;
  2. Compaction will read the old file and write a new file with unused and stale entries removed. Note that if an entry is referenced by a snapshot held by queries, it won't be cleaned up by the compaction process. You can check and share the panel "Hummock Manager - Object Total Size" and "Hummock Manager - Epoch". We can help you identify whether there is any snapshot held for too long that can potentially block data clean-up.

  3. The compacted old file will be deleted from object store asynchronously during the so-called vacuum process. The vacuum process will be triggered every 30s by default so that the storage space should be released relatively quickly after compaction is done.

rotten commented 3 months ago

Thank you. We will examine those metrics to see if anything interesting reveals itself as to why our S3 volumes grow in size indefinitely, and will report back here soon.

rotten commented 3 months ago

Digging deeper, I found these log entries get emitted on a regular basis:

WARN risingwave_storage::hummock::compactor::compactor_runner: Not enough core parallelism to serve the task 11636179 task_parallelism 1 running_task_parallelism 10 max_task_parallelism 10

I am currently running 9 compactor nodes, each with 4CPU and 8G. They appear to be about 50% cpu utilized and have plenty of extra memory.
Screenshot from 2024-04-09 13-30-22 This grafana chart shows the total number of errors constantly increasing. I'm guessing it is related.

  1. If this is the root cause, should this be an ERROR rather than a WARN? I'll need to configure an alarm for this if it really is an ERROR.
  2. The documentation for compactors indicated that scaling the individual nodes larger wouldn't help beyond a certain size. I can force more nodes in the cluster. I've bumped it to 18 nodes to see if it will help.

Meanwhile the size is now 42 terrabytes. If forcing more compactors doesn't help, I will need to rebuild it. I'm getting grief from the finance guys about this.

rotten commented 3 months ago

If the compactors scale up to 10 nodes, then later auto scale back down to, say 4, nodes, could it cause something to still think it had 10 nodes and start throwing these errors? At first glance, about 30 minutes after making the change, it appears that scaling up to 18 nodes has stopped these errors. Not sure if it stopped the incessant S3 data growth yet.

rotten commented 3 months ago

It looks like storage usage has turned the corner and is starting to come down. I think what happens in our environments is that the initial database load (from postgresql cdc) triggers autoscaling on the compactors. Then once the load is complete, we autoscale them back down to some number less than the peak. Subsequent data changes are never large enough to push the compactor scaling back up to the peak levels. Vaccuum/Compaction starts failing with that warning. S3 grows indefinitely.

Manually making the cluster have more nodes than "peak", enables compaction to start working again - even if many of the extra nodes are mostly idle.

There aught to be another way to tell the cluster to stop trying to use the nodes that are no longer there.

rotten commented 3 months ago

Bumping the number of compactors definitely removed almost all of those insufficient parallelism warnings. But not all. There were 3 or 4 emitted overnight.

However...

The S3 storage space started coming down for the first few hours. We freed up a few hundred GB. Then it stopped going down. And now it is going back up again. This was not the root cause.

PostgreSQL storage for the tables being replicated into RisingWave is at most 200G. The materialized views and indexes in RisingWave may explode that a little, but I would not expect it to exceed even 200G more, let alone 1T, especially because we are using very narrow materialized views consisting almost entirely of integers. Instead we are back to over 42T of S3 storage that doesn't seem reclaimable.

rotten commented 3 months ago

Huh. Drag and Drop didn't work for those images (above). Here they are. Screenshot from 2024-04-10 08-41-20 Screenshot from 2024-04-10 08-38-34

rotten commented 3 months ago

I think we'll go ahead and upgrade from 1.7.2 to 1.8.0 today and rebuild the database from scratch. I'm going to pin the total number of compactors at 3 and disable autoscaling for them. We'll see if the endless storage growth is still an issue.

hzxa21 commented 3 months ago

Huh. Drag and Drop didn't work for those images (above). Here they are. Screenshot from 2024-04-10 08-41-20 Screenshot from 2024-04-10 08-38-34

Hmm... Based on the panels you shared, I think there are stale versions/epochs being referenced somewhere, causing storage space cannot be freed. You can see "Object Total Size - referenced by non-current versions" is huge and normally it should be small. You can also see that the gap between "safe epoch" (the minimal epoch that is pined by frontend node) and max committed epoch is large while most of the time there should be no gap.

I suspect there may be some zombie queries running in frontend. Can you share the panels under "Frontend" and "Batch Metrics" as well? You can also try restarting all nodes in the cluster including Compute/Frontend/Meta nodes to see whether "Object Total Size - referenced by non-current versions" drops.

rotten commented 3 months ago

I just tore the whole cluster down and rebuilt it, so it is too late to see if anything was still open in the frontend. (It is unlikely, everything connects via an API, and the API would run out of threads if it was holding queries open. Also it has a statement timeout configured for each connection.) It will take a few hours for the initial data to load, and then we can run some tests again, and I'll follow up with whether S3 continues to grow in size indefinitely on version 1.8.0 the way it did on 1.7.2.

I have the compactors pinned to 3 nodes with no autoscaling this time as well. (That is a dynamic setting, so we could easily re-enable autoscaling on the fly if we decide it isn't going to cause problems any more.

hzxa21 commented 3 months ago

Okay. Let's wait and see and keep us posted.

rotten commented 3 months ago

After rebuilding the database and pinning exactly 3 compactors I'm still seeing the occasional message in the compactor logs:

WARN risingwave_storage::hummock::compactor::compactor_runner: Not enough core parallelism to serve the task 54965 task_parallelism 1 running_task_parallelism 10 max_task_parallelism 10

That can't be the result of down scaling because we haven't changed the number of compactors. They are fixed at 3.

With most of the initial load complete, the S3 storage size is 42GB rather than the 42TB we saw before we rebuilt. I expect that to climb slightly as we start running tests. We'll keep tracking it and report back soon.

hzxa21 commented 3 months ago
WARN risingwave_storage::hummock::compactor::compactor_runner: Not enough core parallelism to serve the task 54965 task_parallelism 1 running_task_parallelism 10 max_task_parallelism 10

To improve CPU usage, our compactor will try to over-pull compaction tasks so it is expected to occasionally see this WARN log in compactor. If the CPU usage of the compactor is not constantly full, you don't need to worry about that.

rotten commented 3 months ago

I wanted to note this slack thread on the risingwave slack that was started by someone else: https://risingwave-community.slack.com/archives/C06R8DA3DGT/p1712846571100339

At first the AI insisted that disk storage is never recovered, but then I jumped in and asked the AI if the compactors would recover it, and the AI seemed to think they should. I wonder if the person posting is having the same issues we are?

Since the rebuild, we have seen S3 usage steadily climbing as we insert/update/delete rows under modest load, while the compactors are now mostly idle (so they aren't trying to reclaim anything). We will run some larger scale changes through the database over the weekend and into early next week and continue to monitor this.

rotten commented 3 months ago

The total storage after the initial load finished processing and the database idled came to about 42G (not much more than when I reported earlier). After a light load yesterday, adding and removing a few thousand rows, the storage jumped to 62G. A 50% increase.

In our other, very lightly loaded environment, the storage jumped from 50G to over 1T overnight, stayed that way for several days, then fell back to 50G again. I have no idea what was going on there. It certainly wasn't related to queries or CDC.

hzxa21 commented 3 months ago

Can you check and share the panels under Hummock Manager, Batch Metrics, and Froentend sections? As mentioned in previous comment https://github.com/risingwavelabs/risingwave/issues/15963#issuecomment-2047944293, I suspect that there are some versions being pinned causing the storage data cannot be GCed.

gshmu commented 1 month ago

@rotten Could you give me some advice?

rotten commented 1 month ago

The only time I've seen the storage come down is if we leave the database idle and don't run any queries for a few days. I don't have any other advice. It is always much larger than the postgresql storage. We gave up on RisingWave for now because of this and other stability issues. It is cool technology that I hope to get back to in another year or after it matures a little more.

hzxa21 commented 1 month ago

Can you check and share the panels under Hummock Manager, Batch Metrics, and Froentend sections? As mentioned in previous comment #15963 (comment), I suspect that there are some versions being pinned causing the storage data cannot be GCed.

As I mentioned here, data cannot be able to reclaimed if there are long running batch query in the cluster so you might want to check whether there is any. We have a query timeout mechanism in place so it is weird to see that the storage space is not reclaim even after hours.

The only time I've seen the storage come down is if we leave the database idle and don't run any queries for a few days. I don't have any other advice. It is always much larger than the postgresql storage. We gave up on RisingWave for now because of this and other stability issues. It is cool technology that I hope to get back to in another year or after it matures a little more.

We have many production clusters with various user workload running at the moment and haven't seen this issue. You can try out v1.8.2 or v1.9.1-rc.2 to see whether the issue persists.

gshmu commented 1 month ago

@rotten thank you.

@hzxa21 previous comment https://github.com/risingwavelabs/risingwave/issues/15963#issuecomment-2047944293 image 404 not found.

all my pg_data less than 30G, but minio disk usage exceeds 3T and continues to grow.

CREATE SOURCE pg_source WITH (
    connector = 'postgres-cdc',

    slot.name = 'my_sync',
);

CREATE TABLE pg_statement (
    id BIGINT primary key,

) FROM pg_source TABLE 'public.statement';

CREATE MATERIALIZED VIEW distinct_bs AS 
SELECT *
FROM (
    SELECT *, 
            ROW_NUMBER() OVER (PARTITION BY my_distinct_fields_length_11 ORDER BY id DESC) AS rn
    FROM pg_statement where deleted_at is null
) t
WHERE rn = 1;

CREATE SINK pg_distinct_bs_sink FROM distinct_bs with(
    connector = 'jdbc',
    jdbc.url = 'jdbc:postgresql:my_connect',
    table.name = 'distinct_bs',
    type = 'upsert',
    primary_key = 'id'
);
hzxa21 commented 1 month ago

distinct_bs MV will maintain some internal states but it should not be significantly larger that the state in your source table. Can you provide the following information?

  1. Are all the nodes alive and healthy in the cluster? If you are deploying with k8s, you can use kubectl get pods to see the state of the pods.
  2. Can you provide the result of the following SQL queries:

    select versions();
    
    SELECT compaction_group_id,
       level_id,
       sstable_id,
       stale_key_count * 1.0 / total_key_count AS tombstone_ratio
    FROM   rw_catalog.rw_hummock_sstables
    ORDER  BY tombstone_ratio desc limit 20;
    
    SELECT compaction_group_id,
       level_id,
       level_type,
       sub_level_id,
       count(*)                       AS total_file_count,
           round(Sum(file_size) / 1024.0) AS total_file_size_kb
    FROM   rw_catalog.rw_hummock_sstables
    GROUP  BY compaction_group_id,
              level_type,
              level_id,
              sub_level_id
    ORDER  BY compaction_group_id,
              level_id,
              sub_level_id DESC,
          level_type DESC;
    
    select * from rw_hummock_compaction_group_configs;
    
    select * from rw_table_stats;
  3. Can you check and provide a screenshot of the following panels in Grafana?
    • grafana risingwave-dev-dashboard:
      • All panels under "Cluster Node" section
      • All panels under "Batch Metrics" section
      • All panels under "Frontend" section
      • Panels "Version Id", "Epoch" and "Object Total Size" under "Hummock Manager" section
      • Panels "Compaction Success Count", "Lsm Compact Pending Bytes" under "Compaction" section.
    • grafana risingwave-user-dashboard:
      • "Alerts" under "Overview" section.
gshmu commented 1 month ago

@hzxa21 we using docker-compose run the risingwave. because the disk continues to grow, so we down the containers.

more info: minio is using a mounted nfs disk.

hzxa21 commented 1 month ago

@hzxa21 we using docker-compose run the risingwave. because the disk continues to grow, so we down the containers.

more info: minio is using a mounted nfs disk.

My guess is that either compaction is severely lagging or there is a long running batch query pinning the stale objects. Which version are you using? Please kindly provide the above information when the issue occurs next time.

gshmu commented 1 month ago

using risingwave sync data in my pc, https://github.com/risingwavelabs/risingwave/issues/15963#issuecomment-2151242044

image image image @hzxa21