Redesign the approach to serve the `query.access_key_list`

gmilescu commented 1 year ago

Intro & recap

This task is a result of investigating the problem with the indexing speed of the state-indexer.

We had to disable the logic around the table account_access_keys
We had to drop the table from the Scylla cluster completely because the partition size grew to an enormous size (>100GB)
Then we were able to enable concurrency for the state-indexer thus improving its speed.

Thus we handled the problem with the state-indexer speed and ran it with the concurrency level of 100 we got a speed of around 80 BPS which allowed us to collect the history in a couple of days.

The problem

As stated in the task https://pagodaplatform.atlassian.net/browse/DPLT-996 the design of the table account_access_keya and the logic around it didn’t prove.

Partitioning by account only leads to unacceptable big partition sizes (over 100GB for some of them like relay.aurora). Mainly the cause is in the data duplication we cannot avoid following this design.
Indexer logic for filling that table involves sequential operations: read at first, update, and then write a new record, and cannot be parallelized. Thus slowing down the indexer dramatically

References:

Description of the account_access_keys table in database design doc https://pagodaplatform.atlassian.net/wiki/spaces/DAT/pages/318013899/Database+design#account_access_keys
In the original ticket, we designed the table and logic in https://pagodaplatform.atlassian.net/browse/DPLT-822
The description and comment section of the https://pagodaplatform.atlassian.net/browse/DPLT-996 might be useful to get all the context

The goal/task of this ticket

We need to design a completely different approach to how we collect and store the data to serve the needs of the query.view_access_key_list RPC method.

In the comment section of the Database design doc, I made a call for help with the design. For convenience, I am copying my original cry for help to this ticket next:

Cry for help

With the account_access_keys table [Database design | account_access_keys|https://pagodaplatform.atlassian.net/wiki/spaces/DAT/pages/318013899/Database+design#account_access_keys] we have a problem with the design.

The design was invented in the ticket https://pagodaplatform.atlassian.net/browse/DPLT-822

Logic explanation:

We observe the StateChangeCauseValueView::AccessKeyUpdate or StateChangeCauseValueView::AccessKeyDeletion
We get the record from account_access_keys for the account and block_height <= ? current block height
We insert/delete the AccessKey to the retrieved account_access_keys.active_access_key
We store the adjusted data as a new record in the account_access_keys table setting the block_height value to the current one

Thus we have snapshots for the specific account with a list of its active access keys for each block. That is necessary to serve the query.view_access_key_list RPC method queries.

The problems

Size and speed

The data is too big in that table. The Scylla team reached out to us letting us know that the size of some partitions is too big. See large items near the cluster

This also affects the speed, since:

we need to index sequentially with a concurrency set to 1 to have this data consistent
also, since we cannot parallelize read-and-write operations for the access key update its speed depends on the data size and network latency. Thus sometimes it takes too much time and speeds down the entire indexer. https://pagodaplatform.atlassian.net/browse/DPLT-996

Expensive operation for the unworthy reason

In addition to the size and speed problem, we have a few others.

First of all, for the most times, we can observe the slowest update access key operation, with relay.aurora. I can confirm that this action has nothing to do with adding a new AccessKey to the account, but just updating the nonce for the AccessKey after the transaction it has been signed with.

~~I am even afraid that we always assume it’s a new AccessKey added every time, so we might have a bunch of duplicates, especially for the fattest accounts from the spreadsheet I’ve shared above.~~

Anyway, it seems quite too expensive to perform such an operation just to update the nonce for the AccessKeys. This entire table and approach should be rethought and redesigned in my opinion. And I need your brains, guys.

Please ask me questions, and share your thoughts.

ND-536 created by None

gmilescu commented 1 year ago

From the Slack thread:

Eduardo Ohe 1 day ago

Hi @Guilherme Nogueira / @khoroletsI had a meeting with @Pavel Kudinov today and I need to prioritize the work about the access_key table that we had to drop recently.We need to figure out a better schema design, @khorolets mentioned that we could rearchitect this table, I think we could do a quick meeting (or continue to chat here async) to define what schema would work and find a common ground that could satisfy:

Required Read query pattern from the API ( @khorolets can help us here )
ScyllaDB clusters Limitation + Table Schema Redesign ( @Guilherme Nogueira and @Eduardo Ohe )

Replies:

Guilherme Nogueira 1 day ago
Hi @Eduardo Ohe. Let's gather all the info async, then we can have a meeting if we still need to discuss it further.@khorolets could you please share the access patterns (all queries) that this table needs to satisfy?

Pavel Kudinov 1 day ago
I believe @khorolets shared all the details in our internal confluence, and I shared them with you Eduardo

Eduardo Ohe 24 hours ago
Sounds good @Guilherme Nogueira, and thanks @Pavel Kudinov I’ll get the queries from it and share here soon

Eduardo Ohe 23 hours ago
account_access_keys definition rom DB design doc:Screenshot 2023-06-08 at 10.41.05 AM.png

Eduardo Ohe 23 hours ago
My analysis on Mar 1st about hot partitions:Screenshot 2023-06-08 at 10.42.07 AM.png

Eduardo Ohe 23 hours ago
How people uses the API on top on the access keys data:
https://docs.near.org/api/rpc/access-keys#view-access-key-list

http://docs.near.org [*Access Keys | NEAR Documentation*|https://docs.near.org/api/rpc/access-keys#view-access-key-list]The RPC API enables you to retrieve information about an account's access keys.

Eduardo Ohe 23 hours ago
TLDR; We need to accommodate how we query the data (what fields we can use in the WHERE clause) vs Table partition

Eduardo Ohe 23 hours ago
Screenshot 2023-06-08 at 10.48.00 AM.png

Eduardo Ohe 23 hours ago
I hope this helps to start this async brainstorm

Guilherme Nogueira 4 hours ago
The proposed design did work in the sense it brought all the account_id entries in a single query. It should be okay-ish if the partitions are up to hundreds of thousands of entries (and you page the query). But it get really messy when the account_id is crossing the multi-gigabyte levels, which I understand is an http://outlier.In a general sense, you need to think how to break down the partitions into smaller bits, then query it all to fetch the data.I'm not sure how is the data serialized, but maybe you can create buckets that has relations with block_height and use that to scan through all possible ranges of block_height. This means that the application has to understand the sequence of block_height and sequentially scan through them.One other way, which I'm not sure is possible, is to bucket by time, such as YYYY-MM-DD. This usually makes scanning logic easier.

Eduardo Ohe 27 minutes ago
Yes, the load test was to make sure ScyllaDB could handle these hot partitions or not.
I think the next step is @khorolets, @Pavel Kudinov and I see if we can change the query pattern and consequently the partition of this table1

Eduardo Ohe 19 minutes ago
@Guilherme Nogueira what if we create:

a table without the blob field so even with hot partitions we won't cross the 100GB limit to work as a lookup with the same current partition key with account_id and cluster column block_height to be able to use the order desc
+ another table with the blob field with a partition key with account_id+ block_height

Eduardo Ohe 18 minutes ago
^ in case we can't change the query pattern

Guilherme Nogueira 14 minutes ago
@Eduardo Ohe preferrably you should have mechanisms in place to avoid partitions growing unbound. Today, the issue was with a handful of partitions that crossed the limit (the ones reported previously) but as your application grows, other partitions will become large themselves

Eduardo Ohe 12 minutes ago
Yes I know what you mean, but by removing the blob field, it should be small. And this is in case we can't change the query pattern

Eduardo Ohe 8 minutes ago
unless you have another solution where we can keep the same queries pattern:

CREATE TABLE IF NOT EXISTS state_changes_access_key (
    account_id varchar,
    block_height varint,
    block_hash varchar,
    data_key varchar,
    data_value BLOB,
    PRIMARY KEY ((account_id, data_key), block_height)
) WITH CLUSTERING ORDER BY (block_height DESC)

--retrieve the list of access keys by the query

SELECT 
  data_key, data_value 
FROM 
  state_changes_access_key
WHERE
  account_id = ? 
  AND block_height <= ?;

Guilherme Nogueira 6 minutes ago
Scylla will issue warnings when a partition crosses compaction_rows_count_warning_threshold which is 1 million rows. In the largest case, you have 42x that value. Way above the best practice limit.There are also warnings about the partition size (which caused issues on the cluster) above 1Gb per partition.So in a general sense, in order to store the 42mi items on the largest partition and still be below 1Gb, each item would be a maximum of ~16 bytes, which is too small to store anything significant.The idea of splitting it into a metadata/data column makes sense from a size perspective, but if the items are still allowed to continue to grow unbound you will still hit sizing limits eventually.

Eduardo Ohe 3 minutes ago
Got it, so the limit is not 100GB.
It's actually:

Row: 1 million
Size: 1GB

per partition

Guilherme Nogueira 2 minutes ago
Correct. Those are the best-practices limits for efficiency.

Eduardo Ohe < 1 minute ago
Ok in that case we don't have any alternative besides change the query itself and the correspondent partition, we will evaluate that internally

Eduardo Ohe < 1 minute ago
Thanks @Guilherme Nogueira appreciate the support

Guilherme Nogueira < 1 minute ago
Sure Eduardo. Let me know if the team has further questions

by 63b75612741248746bf8a243

gmilescu commented 1 year ago

TLDR of the thread below:
Actual best-practices limits for efficiency are 1M rows and 1GB max per partition. In this case we don't have any alternative besides change the query itself and the correspondent partition.

by 63b75612741248746bf8a243

gmilescu commented 1 year ago

Given the ScyllaDB best-practices limits for efficiency of 1M rows and 1GB max per partition, we could have a helper table to serve as a partition map.

We could define a helper_partition_id column that could be calculated based on ranges of block_height.

There are some block_height outliers that can contain more records per block. e.g.

but the majority contains 100 or less records per block_height

The boundaries that we have currently is:

min block_height = 10,018,925

max block_height = 94,050,975

if we consider ~100 rows per block_height , and the limit of 1M rows per partition we can have 10K partitions to hold 1M blocks, in other words:

ranges on 10K block_height will be a partition. e.g.

helper_partition_id 10,010,000 → includes block_heights from 10,010,000 to 10,020,000

Here is a suggestion of schema/partition/queries:

CREATE TABLE IF NOT EXISTS state_changes_access_key_partition_map (
    account_id varchar,
    helper_partition_id varint, 
    min_block_date date, -- optional
    max_block_date date, -- optional
    min_block_height varint, -- optional
    max_block_height varint, -- optional
    PRIMARY KEY (account_id, helper_partition_id)
) WITH CLUSTERING ORDER BY (helper_partition_id DESC)

CREATE TABLE IF NOT EXISTS state_changes_access_key (
    account_id varchar,
    helper_partition_id varint,
    block_height varint,
    block_hash varchar,
    data_key varchar,
    data_value BLOB,
    PRIMARY KEY ((account_id, helper_partition_id), block_height)
) WITH CLUSTERING ORDER BY (block_height DESC)

--retrieve the list of partition_ids
SELECT 
  helper_partition_id
FROM 
  state_changes_access_key_partition_map
WHERE
  account_id = ?
  AND helper_partition_id <= ; -- here helper_partition_id is just a cluster column

-- retrieve the list of access keys 
SELECT 
  data_key, data_value 
FROM 
  state_changes_access_key
WHERE
  account_id = ? 
  AND helper_partition_id IN (); - here helper_partition_id is a real partition
 column

by 63b75612741248746bf8a243

gmilescu commented 1 year ago

Eduardo Ohe thanks for such great work you did! Could you please elaborate more on a few questions I have:

What are the assumptions on the writing to that table?
What about the performance with ScyllaDB and us trying to get the list of records instead of a single one?
From what I see we’re going to have a lot of partitions for that “fat” accounts and presumably only 1 partition for the small ones. So in order to get the entire list of the access keys for a “fat” account we would provide the complete list of partitions (even if it is an enormous number). Did I read you correctly? If so, my question #2 is raised again.

Appreciate your answers!

by 61c24882bce5e000697cf541

gmilescu commented 1 year ago

Thanks Bohdan Khorolets for the kind words and feedback, sure no problem:

While reading the StreamerMessage I assume that it will be ordered by the block_height, we can write as we were writing before normally into the state_changes_access_key table and any time we cross the 1K boundaries of block_height for an account_id we can insert into the state_changes_access_key_partition_map table.
Performance wise in the writing phase it will do an extra writing in the map table, in the reading phase it will increase based on the number of partitions that the account_id has data e.g. account_id has data in 10 helper_partition_id, this means 1 read in the map table + 10 reads in the data table.
We can have a lot of partitions for both “fat” and small accounts if they have data spread in different helper_partition_ids the mechanism is the same for both, it will require these round trips to get the data. According to Guilherme we also can NOT use the IN clause and he suggested that we do the reads and handle/merge the data in the app side. https://pagodaplatform.slack.com/archives/C04V55V3YQJ/p1686607171667509?thread_ts=1686606615.333029&cid=C04V55V3YQJ each round trip cost is on average ~0.37 mili
seconds and 3 milliseconds for 99% percentile

by 63b75612741248746bf8a243

gmilescu commented 1 year ago

Bohdan Khorolets
Pavel Kudinov
I’ll close this ticket since we have the new schema/design but please feel free to reopen in case it doesn’t work.

by 63b75612741248746bf8a243

gmilescu commented 1 year ago

Eduardo Ohe let me keep it open a bit. I can't drive deep into it at the moment. I prefer keep it open unless I'm sure we're good here.

by 61c24882bce5e000697cf541

gmilescu commented 1 year ago

Sure, no problem Bohdan Khorolets

by 63b75612741248746bf8a243

gmilescu commented 1 year ago

Eduardo Ohe have we considered a partition key with a bigger number of columns?

CREATE TABLE IF NOT EXISTS account_access_keys (
    account_id varchar,
    block_height varint,
    active_access_keys map,
    PRIMARY KEY ((account_id, block_height), )
)

With a quick check (I may be wrong here), we don’t need sorting in our read queries.
Also, as I get the idea of Scylla, small partitions should not be a problem as well.
cc Bohdan Khorolets

by 61c2497ca54af90069b78020

gmilescu commented 1 year ago

While brainstorming this issue, I actually realized that I'm trying to design our Rocks DB solution.
Instead of doing this, we should investigate how Nearcore storage is implemented and maybe use an existing approach. My next plan is to read this https://pagodaplatform.atlassian.net/wiki/spaces/EAP/pages/155287563/How+state+is+stored+in+DB

by 61c2497ca54af90069b78020

gmilescu commented 1 year ago

Hi Olga Telezhnaya sounds good. no problem. About “have we considered a partition key with a bigger number of columns?”, this is a good question, the partition key is also defined by how we are going to query the table, so if we can add more columns in the query then we could add those column in the partition key.

by 63b75612741248746bf8a243

near / nearcore