scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.49k stars 1.28k forks source link

System.paxos table is unusually large, equivalent to GSI capacity #7839

Open Felix-zhoux opened 3 years ago

Felix-zhoux commented 3 years ago

Installation details Scylla version (or git commit hash): 4.2.1-0.20201108.4fb8ebccff Cluster size: 9 OS (RHEL/CentOS/Ubuntu/AWS AMI): CentOS

Hardware details (for performance issues) Delete if unneeded Platform (physical/VM/cloud instance type/docker): physical Hardware: sockets= cores=10 hyperthreading= memory=50G Disks: (SSD/HDD, count) 3.2T NVMe * 1

I created a table testgsi_btb2 with GSI testgsi_itb2 using the Alternator interface, and continued to write data(Put item with LWT ) to this table.

cqlsh> desc KEYSPACE alternator_testgsi_btb2 

CREATE KEYSPACE alternator_testgsi_btb2 WITH replication = {'class': 'NetworkTopologyStrategy', 'Beijing_data_center': '3', 'Hangzhou_data_center': '3', 'Nanjing_data_center': '3'}  AND durable_writes = true;

CREATE TABLE alternator_testgsi_btb2.testgsi_btb2 (
    obj text,
    bi text,
    ":attrs" map<text, blob>,
    PRIMARY KEY (obj, bi)
) WITH CLUSTERING ORDER BY (bi ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

CREATE MATERIALIZED VIEW alternator_testgsi_btb2."testgsi_btb2:testgsi_itb2" AS
    SELECT *
    FROM alternator_testgsi_btb2.testgsi_btb2
    WHERE "bi" IS NOT NULL AND "bi" IS NOT NULL
    PRIMARY KEY (bi, obj)
    WITH CLUSTERING ORDER BY (obj ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE'

As the test progressed, I found that the system.paxos table also continued to increase, and its capacity was almost equivalent to the base table and GSI.

Before test

[root@node13 data]# du -sh *
0   alternator_testgsi_btb2
728K    system

After test

[root@node13 data]# du -sh *
224G    alternator_testgsi_btb2
82G system
[root@node13 alternator_testgsi_btb2]# du -sh *
94G testgsi_btb2-b548a3e048b111eb9fc9000000000004
96G testgsi_btb2:testgsi_itb2-b548a3e248b111eb9fc9000000000004
[root@node13 system]# du -sh *|grep paxos
**84G** paxos-b7b7f0c2fd0a34108c053ef614bb7c2d

The capacity water level of each node is basically the same:

# nodetool status
Datacenter: data_center_1
===============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.10.10.111  263.57 GB  256          ?       50d20787-4c68-447f-9fc8-4ea43f56c7b7  rack1
UN  10.10.10.112  263.5 GB   256          ?       72c7e6da-840c-45df-b745-d0459a7e0644  rack1
UN  10.10.10.113  263.72 GB  256          ?       9476ea95-fa28-4433-9df2-b452cf1a2737  rack1
Datacenter: data_center_2
================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.10.10.131  263.62 GB  256          ?       c21baf36-e385-4b18-b79c-cde7abc7c678  rack1
UN  10.10.10.132  263.49 GB  256          ?       625a6b9a-57aa-4f28-9aad-911cd3236fd8  rack1
UN  10.10.10.133  263.37 GB  256          ?       a3a76eda-3577-42f8-87a9-c6035916edf0  rack1
Datacenter: data_center_3
===============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.10.10.122  263.89 GB  256          ?       57e62f1e-5086-4bfb-af86-1614c688b56c  rack1
UN  10.10.10.123  264.46 GB  256          ?       97bef44e-8db9-4415-a60f-fb2b93e93e78  rack1
UN  10.10.10.121  264.06 GB  256          ?       58733550-524f-440d-a20a-9bff0d58d2b6  rack1

So what is it in system.paxos ? I learned on the official website that scylla does not store hints for lightweight transaction writes, since this is redundant as all such writes are already present in system.paxos table. During the test ,all nodes are UN status, what dose hints mean ?

gleb-cloudius commented 3 years ago

All writes go to system.paxos before they end up in their final destination. Scylla tries to remove actual data from system.paxos after write is complete but metadata (such as paxos ballot of the write) still stays. If the size of the data you are writing is comparable with size of the metadata paxos leaves in system.paxos after each write the result is not surprising. Another reason for the large size may be that deletes that paxos issue end up in separate sstable from writes. Try to run major compaction on system.paxos and see if there is a substantial size difference before and after.

-- Gleb.

kostja commented 3 years ago

If of any consolation, data from system.paxos will naturally get expired in the period of 10 days. You could speed this up by reducing gc_grace_seconds for system.paxos. Starting from 4.3 there is paxos_grace_seconds configuration parameter settable per table (not once for all tables, like gc_grace_seconds on system.paxos).

nyh commented 3 years ago

If of any consolation, data from system.paxos will naturally get expired in the period of 10 days. You could speed this up by reducing gc_grace_seconds for system.paxos. Starting from 4.3 there is paxos_grace_seconds configuration parameter settable per table (not once for all tables, like gc_grace_seconds on system.paxos).

@kostja if I understand correctly, reducing gc_grace_seconds is not enough, and you also need to reduce the TTL of the items in this table? How does one do that?

kostja commented 3 years ago

The TTL of individual items can not be changed. It is a derived value, it is taken from paxos_grace_seconds of the base table, otherwise, if it's not set, from gc_grace_seconds of system.paxos

nyh commented 3 years ago

Thanks. I guess the next step is to decide whether this issue should be closed, or we still need to do something:

  1. Do we consider the fact that the paxos table grows even after fully successful requests, a bug we eventually want to fix, or a fact of live we will probably never fix until we stop using Paxos?
  2. Is this overhead, and what you explained above (the specific way in which gc_grace_seconds of system.paxos and paxos_grace_seconds of the table control both TTL and GC) documented anywhere in the CQL documentation?
  3. We also need to document it in the Alternator documentation, not just the CQL documentation (because Alternator users are not expected to read CQL documentation).
Felix-zhoux commented 3 years ago

Is this phenomenon related to GSI? There is no such problem when writing data to the base table without GSI.

gleb-cloudius commented 3 years ago

On Mon, Dec 28, 2020 at 03:09:16AM -0800, Xiang Zhou wrote:

Is this phenomenon related to GSI? There is no such problem when writing data to the base table without GSI.

What size of the paxos table do you see without GSI? If it is much smaller it may be that the system is overloaded and the paxos round that suppose to prune the data from paxos table fails. Do you see any write failures?

-- Gleb.

slivne commented 3 years ago

@Felix-zhoux if you can share the info above we will be happy to further analyze this

Felix-zhoux commented 3 years ago

I met a problem when setting parameter gc_grace_seconds value to 0

cqlsh> ALTER TABLE system.paxos WITH gc_grace_seconds = 0;
Unauthorized: Error from server: code=2100 [Unauthorized] message="system keyspace is not user-modifiable."

I use superuser to log in to cqlsh, but still can't set

cassandra@cqlsh> ALTER TABLE system.paxos WITH gc_grace_seconds = 0;
Unauthorized: Error from server: code=2100 [Unauthorized] message="system keyspace is not user-modifiable."
Felix-zhoux commented 3 years ago

Is this phenomenon related to GSI? There is no such problem when writing data to the base table without GSI.

This may be wrong. In the same cluster(In other clusters, it seems that such obvious problems have not been observed), when I write data to a table without GSI, I still see that the space of system.paxos continues to grow. The amount of data increased in the paxos table is basically equal to the amount of data written.

[root@node13 data]# du -sh *
244G    alternator_testgsi_btb2
524G    alternator_testnogsi_btb3
578G    system
gleb-cloudius commented 3 years ago

You cannot alter system table. And if you could setting gc_grace_seconds to zero will cause queries to be non linearisable and defeats the purpose of paxos.

Did you try run major compaction on system.paxos and see if it reduces amount of data there? What is the size of individual writes into the base table? Do you see any errors in Scylla logs?

-- Gleb.

nyh commented 3 years ago

You cannot alter system table. And if you could setting gc_grace_seconds to zero will cause queries to be non linearisable and defeats the purpose of paxos.

The suggestion above was to reduce the TTL, which @kostja said can be done by reducing gc_grace_seconds - since the TTL is set to exactly gc_grace_seconds. Yes, it shoudn't be set to 0 for this reason (you don't want a TTL of 0), but @Felix-zhoux now says it cannot be set at all, so we need a different way to set this TTL and gc_grace_seconds, no?

Did you try run major compaction on system.paxos and see if it reduces amount of data there?

Good question. If the data in the Paxos table was already deleted, then what takes up the space isn't the data, but just a lot of tombstones, and only a major compaction (nodetool compact) is guaranteed to clear them. But, as pointed out above, they can only be cleared after the gc period has passed - which defaults to 10 days - which is probably much too much.

The other option is that what remains in the Paxos table is actual undeleted data. In this case, a major compaction won't help anything. Only expiring the data will help - but again this takes 10 days by default.

gleb-cloudius commented 3 years ago

On Mon, Jan 04, 2021 at 12:09:33AM -0800, nyh wrote:

You cannot alter system table. And if you could setting gc_grace_seconds to zero will cause queries to be non linearisable and defeats the purpose of paxos.

The suggestion above was to reduce the TTL, which @kostja said can be done by reducing gc_grace_seconds - since the TTL is set to exactly gc_grace_seconds. Yes, it shoudn't be set to 0 for this reason (you don't want a TTL of 0), but @Felix-zhoux now says it cannot be set at all, so we need a different way to set this TTL and gc_grace_seconds, no?

The value is derived from gc_grace_seconds of a table the lwt operation is done upon, not from system.paxos table itself.

Did you try run major compaction on system.paxos and see if it reduces amount of data there?

Good question. If the data in the Paxos table was already deleted, then what takes up the space isn't the data, but just a lot of tombstones, and only a major compaction (nodetool compact) is guaranteed to clear them. But, as pointed out above, they can only be cleared after the gc period has passed - which defaults to 10 days - which is probably much too much.

system.paxos gc_graxe_seconds is zero.

-- Gleb.

gleb-cloudius commented 3 years ago

The value is derived from gc_grace_seconds of a table the lwt operation is done upon, not from system.paxos table itself.

Even that is not correct. It is derived from paxos_grace_seconds of a base table or set to 864000 if the former is not present.

nyh commented 3 years ago

So @Felix-zhoux to summarize what @gleb-cloudius said, you should set paxos_grace_seconds on the base table (not the paxos table). This will determine the TTL and the paxos table, and the gc_grace_seconds on that paxos table is zero anyway. According to @kostja you need Scylla 4.3 or above for this. @gleb-cloudius is this documented somewhere - so the next user won't confuse all the different settings like we did here?

gleb-cloudius commented 3 years ago

@gleb-cloudius is this documented somewhere - so the next user won't confuse all the different settings like we did here?

Not sure. Need to check.

But please do major compaction on system.paxos as well. I suspect we may not delete data in memtable or during memtable->sstable flush and end up with both data and the tombstone on disk waiting to be compacted.

-- Gleb.

Felix-zhoux commented 3 years ago

I upgraded the cluster to version 4.3.rc3 Then set the parameters gc_grace_secondsand paxos_grace_secondsfor the two tables respectively

Table with GSI: gc_grace_seconds=1, paxos_grace_seconds=0 Another table: gc_grace_seconds=0, paxos_grace_seconds=0

cqlsh> ALTER TABLE alternator_testgsi_btb2.testgsi_btb2 WITH gc_grace_seconds = 1;
cqlsh> ALTER TABLE alternator_testgsi_btb2.testgsi_btb2 WITH paxos_grace_seconds = 0;

cqlsh> ALTER TABLE alternator_testnogsi_btb3.testnogsi_btb3 WITH gc_grace_seconds = 0;
cqlsh> ALTER TABLE alternator_testnogsi_btb3.testnogsi_btb3 WITH paxos_grace_seconds = 0;

Then perform nodetool compact for these two tables and paxostable.

#nodetool compact system paxos
#nodetool compact alternator_testgsi_btb2
#nodetool compact alternator_testnogsi_btb3.testnogsi_btb3

Finally, it was found that the paxostable capacity still has more than 500 GB. before

247G    alternator_testgsi_btb2
520G    alternator_testnogsi_btb3
593G    system

after

61G alternator_testgsi_btb2
520G    alternator_testnogsi_btb3
534G    system

The parameter setting is effective, because the capacity of the table with GSI has been reduced from more than 200 GB to 61 GB, this is because I recently deleted some data on this table. The remaining 500 GB of data in the paxos table should not be tomstones.

gleb-cloudius commented 3 years ago

If you change paxos_grace_seconds after you inserted data already it does not affect TTL of already existing entries.

If you change paxos_grace_seconds to zero you may as well not use paxos at all.

How many distinct partitions do you have and how much data you have in each one?

-- Gleb.

Felix-zhoux commented 3 years ago

If you change paxos_grace_seconds after you inserted data already it does not affect TTL of already existing entries.

Okay, I will continue to write some data into these two tables, and then see if the paxos table capacity will increase, if it grows, then execute paxos compact.

How many distinct partitions do you have and how much data you have in each one?

Table with GSI(testgsi_btb2): The remaining 29 million rows of data in the base table, each row of data is a separate partition GSI has only two partitions, there are more than 10 million rows of data in each partition Another table(testnogsi_btb3): 500 million rows of data, each row of data is a separate partition.

gleb-cloudius commented 3 years ago

On Tue, Jan 05, 2021 at 02:19:20AM -0800, Xiang Zhou wrote:

How many distinct partitions do you have and how much data you have in each one?

Table with GSI(testnogsi_btb3): The remaining 25 million rows of data in the base table, each row of data is a separate partition GSI has only two partitions, there are more than 10 million rows of data in each partition Another table(testnogsi_btb3): 500 million rows of data, each row of data is a separate partition.

GSI is not interesting since you do not write there with paxos. Which of remaining two tables you do write to with paxos? What is the size of those partitions?

-- Gleb.

gleb-cloudius commented 3 years ago

On Tue, Jan 05, 2021 at 12:37:58PM +0200, Gleb Natapov wrote:

On Tue, Jan 05, 2021 at 02:19:20AM -0800, Xiang Zhou wrote:

How many distinct partitions do you have and how much data you have in each one?

Table with GSI(testnogsi_btb3): The remaining 25 million rows of data in the base table, each row of data is a separate partition GSI has only two partitions, there are more than 10 million rows of data in each partition Another table(testnogsi_btb3): 500 million rows of data, each row of data is a separate partition.

GSI is not interesting since you do not write there with paxos. Which of remaining two tables you do write to with paxos? What is the size of those partitions?

If you write to both tables with paxos we can do some calculations given on disk sizes:

table with gsi: 64G/25M = 2620B table without gsi: 520G/500M = 1117B Paxos table: 536G/525M = 1096B

So average paxos entry is smaller than table writes, although it is very close to size of a write to non gsi table. Pruned paxos table entry contains a key of a row of the base table + uuid_type + timeuuid_type + timeuuid_type + timeuuid_type. Only uuids give us 512B which is half of the average entry size. What is the length of the key in those two tables? If it is around 400B the size of paxos table make sense.

-- Gleb.

Felix-zhoux commented 3 years ago

GSI takes up half of the storage space of testgsi_btb2:

[root@node14 alternator_testgsi_btb2]# du -sh *
31G testgsi_btb2-b548a3e048b111eb9fc9000000000004
31G testgsi_btb2:testgsi_itb2-b548a3e248b111eb9fc9000000000004

What I saw is that the GSI created by Alternator is essentially a MATERIALIZED VIEW, is it true?

Back to topic The exact number of remaining rows in the base table is 29 million rows. Both two tables are writed with paxos. So : table with gsi: 31G/29M = 1120 table without gsi: 520G/500M = 1090 Paxos table: 534G/529M = 1058

The following are the keys of my two tables:

cqlsh> select obj,bi FROM alternator_testgsi_btb2.testgsi_btb2 LIMIT 1;

 obj                | bi
--------------------+-----
 my50n5obj880267903 | bi1

I think that the data in paxos and the total amount of data I write to the two tables are basically the same size, because they grow linearly together.

gleb-cloudius commented 3 years ago

On Tue, Jan 05, 2021 at 03:33:47AM -0800, Xiang Zhou wrote:

GSI takes up half of the storage space of testgsi_btb2:

[root@node14 alternator_testgsi_btb2]# du -sh *
31G   testgsi_btb2-b548a3e048b111eb9fc9000000000004
31G   testgsi_btb2:testgsi_itb2-b548a3e248b111eb9fc9000000000004

What I saw is that the GSI created by Alternator is essentially a MATERIALIZED VIEW, is it true?

As far as I know yes.

Back to topic The exact number of remaining rows in the base table is 29 million rows. Both two tables are writed with paxos. So : table with gsi: 31G/29M = 1120 table without gsi: 520G/500M = 1090 Paxos table: 534G/529M = 1058

The following are the keys of my two tables:

cqlsh> select obj,bi FROM alternator_testgsi_btb2.testgsi_btb2 LIMIT 1;

 obj                | bi
--------------------+-----
 my50n5obj880267903 | bi1

So the key is around 20 bytes long. It means that we can explain ~550 from each entry in the paxos table. We have 500 more from an unknown origin. Nadav do you know what overhead of sstable format is expected for paxos table schema?

OTOH it is very strange that this kind of base table record like you show above will take ~1000B like our calculations show.

I think that the data in paxos and the total amount of data I write to the two tables are basically the same size, because they grow linearly together.

This is because the size of the data you a writing into the base table is close to the record size paxos saves for each write.

-- Gleb.

nyh commented 3 years ago

So the key is around 20 bytes long. It means that we can explain ~550 from each entry in the paxos table. We have 500 more from an unknown origin. Nadav do you know what overhead of sstable format is expected for paxos table schema?

I don't have any such estimates off-hand, but obviously the sstables have some overhead - all sort of size fields, indexes, etc. Overhead would be particularly large if the Paxos table has small partitions - is that the case? By the way did you do a "nodetool compact" (major compaction) before estimating the size?

gleb-cloudius commented 3 years ago

On Tue, Jan 05, 2021 at 05:36:58AM -0800, nyh wrote:

So the key is around 20 bytes long. It means that we can explain ~550 from each entry in the paxos table. We have 500 more from an unknown origin. Nadav do you know what overhead of sstable format is expected for paxos table schema?

I don't have any such estimates off-hand, but obviously the sstables have some overhead - all sort of size fields, indexes, etc. Overhead would be particularly large if the Paxos table has small partitions - is that the case?

It is.

By the way did you do a "nodetool compact" (major compaction) before estimating the size?

That is my understanding from previous messages.

-- Gleb.

Felix-zhoux commented 3 years ago

By the way did you do a "nodetool compact" (major compaction) before estimating the size?

Yes, before getting these data, I performed nodetool compact on all three tables (testgsi_btb2、testnogsi_btb3、paxos). And parameters gc_grace_secondsand paxos_grace_secondsfor the two tables have been set, as I mentioned above.

Then set the parameters gc_grace_secondsand paxos_grace_secondsfor the two tables respectively

Table with GSI: gc_grace_seconds=1, paxos_grace_seconds=0 Another table: gc_grace_seconds=0, paxos_grace_seconds=0

cqlsh> ALTER TABLE alternator_testgsi_btb2.testgsi_btb2 WITH gc_grace_seconds = 1;
cqlsh> ALTER TABLE alternator_testgsi_btb2.testgsi_btb2 WITH paxos_grace_seconds = 0;

cqlsh> ALTER TABLE alternator_testnogsi_btb3.testnogsi_btb3 WITH gc_grace_seconds = 0;
cqlsh> ALTER TABLE alternator_testnogsi_btb3.testnogsi_btb3 WITH paxos_grace_seconds = 0;
nyh commented 3 years ago

I haven't worked on this issue. Unassigning myself for now.

kostja commented 7 months ago

This works as designed/known limitation. It could stand as a placeholder for immediate consistency or LWT over Raft, but we have enough of those.