Open gmilescu opened 1 year ago
From the Slack thread:
Eduardo Ohe 1 day ago
Hi @Guilherme Nogueira / @khoroletsI had a meeting with @Pavel Kudinov today and I need to prioritize the work about the access_key table that we had to drop recently.We need to figure out a better schema design, @khorolets mentioned that we could rearchitect this table, I think we could do a quick meeting (or continue to chat here async) to define what schema would work and find a common ground that could satisfy:
Replies:
Guilherme Nogueira 1 day ago
Hi @Eduardo Ohe. Let's gather all the info async, then we can have a meeting if we still need to discuss it further.@khorolets could you please share the access patterns (all queries) that this table needs to satisfy?
Pavel Kudinov 1 day ago
I believe @khorolets shared all the details in our internal confluence, and I shared them with you Eduardo
Eduardo Ohe 24 hours ago
Sounds good @Guilherme Nogueira, and thanks @Pavel Kudinov I’ll get the queries from it and share here soon
Eduardo Ohe 23 hours ago
account_access_keys definition rom DB design doc:Screenshot 2023-06-08 at 10.41.05 AM.png
Eduardo Ohe 23 hours ago
My analysis on Mar 1st about hot partitions:Screenshot 2023-06-08 at 10.42.07 AM.png
Eduardo Ohe 23 hours ago
How people uses the API on top on the access keys data:
https://docs.near.org/api/rpc/access-keys#view-access-key-list
http://docs.near.org [*Access Keys | NEAR Documentation*|https://docs.near.org/api/rpc/access-keys#view-access-key-list]The RPC API enables you to retrieve information about an account's access keys.
Eduardo Ohe 23 hours ago
TLDR; We need to accommodate how we query the data (what fields we can use in the WHERE clause) vs Table partition
Eduardo Ohe 23 hours ago
Screenshot 2023-06-08 at 10.48.00 AM.png
Eduardo Ohe 23 hours ago
I hope this helps to start this async brainstorm
Guilherme Nogueira 4 hours ago
The proposed design did work in the sense it brought all the account_id entries in a single query. It should be okay-ish if the partitions are up to hundreds of thousands of entries (and you page the query). But it get really messy when the account_id is crossing the multi-gigabyte levels, which I understand is an http://outlier.In a general sense, you need to think how to break down the partitions into smaller bits, then query it all to fetch the data.I'm not sure how is the data serialized, but maybe you can create buckets that has relations with block_height and use that to scan through all possible ranges of block_height. This means that the application has to understand the sequence of block_height and sequentially scan through them.One other way, which I'm not sure is possible, is to bucket by time, such as YYYY-MM-DD. This usually makes scanning logic easier.
Eduardo Ohe 27 minutes ago
Yes, the load test was to make sure ScyllaDB could handle these hot partitions or not.
I think the next step is @khorolets, @Pavel Kudinov and I see if we can change the query pattern and consequently the partition of this table1
Eduardo Ohe 19 minutes ago
@Guilherme Nogueira what if we create:
Eduardo Ohe 18 minutes ago
^ in case we can't change the query pattern
Guilherme Nogueira 14 minutes ago
@Eduardo Ohe preferrably you should have mechanisms in place to avoid partitions growing unbound. Today, the issue was with a handful of partitions that crossed the limit (the ones reported previously) but as your application grows, other partitions will become large themselves
Eduardo Ohe 12 minutes ago
Yes I know what you mean, but by removing the blob field, it should be small. And this is in case we can't change the query pattern
Eduardo Ohe 8 minutes ago
unless you have another solution where we can keep the same queries pattern:
CREATE TABLE IF NOT EXISTS state_changes_access_key ( account_id varchar, block_height varint, block_hash varchar, data_key varchar, data_value BLOB, PRIMARY KEY ((account_id, data_key), block_height) ) WITH CLUSTERING ORDER BY (block_height DESC) --retrieve the list of access keys by the query SELECT data_key, data_value FROM state_changes_access_key WHERE account_id = ? AND block_height <= ?;
Guilherme Nogueira 6 minutes ago
Scylla will issue warnings when a partition crosses compaction_rows_count_warning_threshold which is 1 million rows. In the largest case, you have 42x that value. Way above the best practice limit.There are also warnings about the partition size (which caused issues on the cluster) above 1Gb per partition.So in a general sense, in order to store the 42mi items on the largest partition and still be below 1Gb, each item would be a maximum of ~16 bytes, which is too small to store anything significant.The idea of splitting it into a metadata/data column makes sense from a size perspective, but if the items are still allowed to continue to grow unbound you will still hit sizing limits eventually.
Eduardo Ohe 3 minutes ago
Got it, so the limit is not 100GB.
It's actually:
per partition
Guilherme Nogueira 2 minutes ago
Correct. Those are the best-practices limits for efficiency.
Eduardo Ohe < 1 minute ago
Ok in that case we don't have any alternative besides change the query itself and the correspondent partition, we will evaluate that internally
Eduardo Ohe < 1 minute ago
Thanks @Guilherme Nogueira appreciate the support
Guilherme Nogueira < 1 minute ago
Sure Eduardo. Let me know if the team has further questions
by 63b75612741248746bf8a243
TLDR of the thread below:
Actual best-practices limits for efficiency are 1M rows and 1GB max per partition. In this case we don't have any alternative besides change the query itself and the correspondent partition.
by 63b75612741248746bf8a243
Given the ScyllaDB best-practices limits for efficiency of 1M rows and 1GB max per partition, we could have a helper table to serve as a partition map.
We could define a helper_partition_id column that could be calculated based on ranges of block_height.
There are some block_height outliers that can contain more records per block. e.g.
but the majority contains 100 or less records per block_height
The boundaries that we have currently is:
min block_height = 10,018,925
max block_height = 94,050,975
if we consider ~100 rows per block_height , and the limit of 1M rows per partition we can have 10K partitions to hold 1M blocks, in other words:
ranges on 10K block_height will be a partition. e.g.
helper_partition_id 10,010,000 → includes block_heights from 10,010,000 to 10,020,000
Here is a suggestion of schema/partition/queries:
CREATE TABLE IF NOT EXISTS state_changes_access_key_partition_map ( account_id varchar, helper_partition_id varint, min_block_date date, -- optional max_block_date date, -- optional min_block_height varint, -- optional max_block_height varint, -- optional PRIMARY KEY (account_id, helper_partition_id) ) WITH CLUSTERING ORDER BY (helper_partition_id DESC) CREATE TABLE IF NOT EXISTS state_changes_access_key ( account_id varchar, helper_partition_id varint, block_height varint, block_hash varchar, data_key varchar, data_value BLOB, PRIMARY KEY ((account_id, helper_partition_id), block_height) ) WITH CLUSTERING ORDER BY (block_height DESC) --retrieve the list of partition_ids SELECT helper_partition_id FROM state_changes_access_key_partition_map WHERE account_id = ? AND helper_partition_id <=; -- here helper_partition_id is just a cluster column -- retrieve the list of access keys SELECT data_key, data_value FROM state_changes_access_key WHERE account_id = ? AND helper_partition_id IN ( ); - here helper_partition_id is a real partition column
by 63b75612741248746bf8a243
Eduardo Ohe thanks for such great work you did! Could you please elaborate more on a few questions I have:
Appreciate your answers!
by 61c24882bce5e000697cf541
Thanks Bohdan Khorolets for the kind words and feedback, sure no problem:
by 63b75612741248746bf8a243
Bohdan Khorolets
Pavel Kudinov
I’ll close this ticket since we have the new schema/design but please feel free to reopen in case it doesn’t work.
by 63b75612741248746bf8a243
Eduardo Ohe let me keep it open a bit. I can't drive deep into it at the moment. I prefer keep it open unless I'm sure we're good here.
by 61c24882bce5e000697cf541
Sure, no problem Bohdan Khorolets
by 63b75612741248746bf8a243
Eduardo Ohe have we considered a partition key with a bigger number of columns?
CREATE TABLE IF NOT EXISTS account_access_keys ( account_id varchar, block_height varint, active_access_keys map, PRIMARY KEY ((account_id, block_height), ) )
With a quick check (I may be wrong here), we don’t need sorting in our read queries.
Also, as I get the idea of Scylla, small partitions should not be a problem as well.
cc Bohdan Khorolets
by 61c2497ca54af90069b78020
While brainstorming this issue, I actually realized that I'm trying to design our Rocks DB solution.
Instead of doing this, we should investigate how Nearcore storage is implemented and maybe use an existing approach. My next plan is to read this https://pagodaplatform.atlassian.net/wiki/spaces/EAP/pages/155287563/How+state+is+stored+in+DB
by 61c2497ca54af90069b78020
Hi Olga Telezhnaya sounds good. no problem. About “have we considered a partition key with a bigger number of columns?”, this is a good question, the partition key is also defined by how we are going to query the table, so if we can add more columns in the query then we could add those column in the partition key.
by 63b75612741248746bf8a243
Intro & recap
This task is a result of investigating the problem with the indexing speed of the state-indexer.
Thus we handled the problem with the state-indexer speed and ran it with the concurrency level of 100 we got a speed of around 80 BPS which allowed us to collect the history in a couple of days.
The problem
As stated in the task https://pagodaplatform.atlassian.net/browse/DPLT-996 the design of the table account_access_keya and the logic around it didn’t prove.
References:
The goal/task of this ticket
We need to design a completely different approach to how we collect and store the data to serve the needs of the query.view_access_key_list RPC method.
In the comment section of the Database design doc, I made a call for help with the design. For convenience, I am copying my original cry for help to this ticket next:
Cry for help
With the account_access_keys table [Database design | account_access_keys|https://pagodaplatform.atlassian.net/wiki/spaces/DAT/pages/318013899/Database+design#account_access_keys] we have a problem with the design.
The design was invented in the ticket https://pagodaplatform.atlassian.net/browse/DPLT-822
Logic explanation:
Thus we have snapshots for the specific account with a list of its active access keys for each block. That is necessary to serve the query.view_access_key_list RPC method queries.
The problems
Size and speed
The data is too big in that table. The Scylla team reached out to us letting us know that the size of some partitions is too big. See large items near the cluster
This also affects the speed, since:
Expensive operation for the unworthy reason
In addition to the size and speed problem, we have a few others.
First of all, for the most times, we can observe the slowest update access key operation, with relay.aurora. I can confirm that this action has nothing to do with adding a new AccessKey to the account, but just updating the nonce for the AccessKey after the transaction it has been signed with.
I am even afraid that we always assume it’s a new AccessKey added every time, so we might have a bunch of duplicates, especially for the fattest accounts from the spreadsheet I’ve shared above.Anyway, it seems quite too expensive to perform such an operation just to update the nonce for the AccessKeys. This entire table and approach should be rethought and redesigned in my opinion. And I need your brains, guys.
Please ask me questions, and share your thoughts.
ND-536 created by None