paradigmxyz / reth

Modular, contributor-friendly and blazing-fast implementation of the Ethereum protocol, in Rust
https://reth.rs/
Apache License 2.0
4.02k stars 1.24k forks source link

Temperature check: support address_getAppearances #4054

Open perama-v opened 1 year ago

perama-v commented 1 year ago

Describe the feature

This issue is to introduce the concept of "address appearances" and assess appetite for inclusion in reth.

Specifications

See also the following specifications:

Motivation

A user (or local application) running reth can start with an address (e.g., 0x30a4...1382) important to them. They query reth and get the set of transactions that are relevant to that address.

>> {"jsonrpc":"2.0","id":1,
    "method":"address_getAppearances",
    "params":["0x30a4639850b3ddeaaca4f06280aa751682f11382"]}
<< {"id":1,"jsonrpc":"2.0","result":[
    {"blockNumber":"0xd63f51","transactionIndex":"0xe6"},
    {"blockNumber":"0xd63f5a","transactionIndex":"0x11b"},
    {"blockNumber":"0xd68154","transactionIndex":"0x6"},
    ...

This kicks off the journey, which can now proceed in many ways:

As addresses are the foundation of user and protocol activity, the feature is general purpose.

Why "in reth" and not separate

The presence of the address_getAppearances natively within reth would mean first class support for user-focused applications. This could lead to:

As the method is general purpose it always supports new wallets/protocols that a user is suddenly interested in.

Prior work

The UnchainedIndex, accessible via trueblocks-core (https://github.com/TrueBlocks/trueblocks-core) is an example of the utility of the information that address_getAppearances would provide.

The method is equivalent to the following command in trueblocks-core:

> chifra list 0x30a4639850b3ddeaaca4f06280aa751682f11382
address blockNumber transactionIndex
0x30a4639850b3ddeaaca4f06280aa751682f11382  14040913    230
0x30a4639850b3ddeaaca4f06280aa751682f11382  14040922    283
0x30a4639850b3ddeaaca4f06280aa751682f11382  14057812    6
...

Disk impact

The support for address_getAppearances creates additional disk burden. An upper ceiling estimate is 80GB.

The estimate is based on the UnchainedIndex (80GB), which is specially designed for distribution/sharding on IPFS. Such requirements mean that the index adds immutable pieces of data every 12 hours. An address that appears every 12 hours is therefore duplicated.

There are likely means to significantly reduce the disk footprint of a native reth version of this data.

Additional context

What is an "appearance"

Appearances are defined in a way to capture instances when an address arises in a meaningful way in the chain.

A simple example is the subject of a CALL opcode to an EOA. The recipient "appears" in that transaction (they receive ether). Upon tracing, the meaning of the transaction can be determined.

An "appearance" is defined in the following PR:

address_getAppearances method

The method is defined in the following PR:

This is a chicken an egg situation: The method is stronger if there is an indication of existence/support in a client and vice versa.

Temperature check

Seeking feedback on this concept. Do any of the following resonate with you? Please feel free to add thoughts of any kind

  1. Is this desirable?
  2. Would a PR implementing a flag-based address_ namespace for address_getAppearances be considered?
  3. "In favour but, would like to see more concrete disk footprint prototype/estimate"
  4. "I can see how this would enable some_application_idea"
  5. "Please keep it off the node and in companion applications/datasets"
  6. <other>

Thanks for your time

### Checklist
- [ ] `address_` namespace addition with placeholder impl of `address_getAppearances` (#4146)
- [ ] implement a `get_appearances_from_transaction()` method (#4162)
- [ ] implement `eth_getAppearancesInBlock` JSON-RPC method
- [ ] define new `AppearanceHistory` table (https://github.com/paradigmxyz/reth/pull/4182)
- [ ] Populate the index during a stage
- [ ] implement `address_getAppearances` endpoint (`Address -> Vec<(BlockNumber, TxIndex)>`)
- [ ] Design mechanism to add APIs like this w/o introducing large main codebase maintenance overheads (#4325)
tjayrush commented 1 year ago

Is this desirable?

https://www.google.com/search?q=getting+a+list+of+transactions+per+address&oq=getting+a+list+of+transactions+per+address&aqs=chrome..69i57j33i160l3.7575j0j7&sourceid=chrome&ie=UTF-8#ip=1

Also, without a complete history of appearances, it's impossible to build a reconcilable accounting (in both the financial sense and the 'state change' sense).

This is exactly why all of us are struggling to do automated accounting "off-chain" against an "on-chain" system that reconciles 240,000,000 addresses every 12 seconds.

mattsse commented 1 year ago

definitely supportive, defer to @rakita @joshieDo if this is currently technically possible or if it would require additional indexes (I think it would)

tjayrush commented 1 year ago

I'm not speaking for the original poster here, but, yes, this would require an additional index and associated hard drive space. That's why it should be optional.

It's also why I proposed (and @perama-v wrote) this feature as well: https://github.com/ethereum/execution-apis/pull/452.

eth_getAddressesInBlock does not require any additional indexes or hard drive space (it can be done on-demand), but it does go a long way towards enabling address_getAppearances.

address_getAppearances, in order to build the index you're asking about, would need to repeatedly call eth_getAddressesInBlock (or it could be called during syncing).

If eth_getAddressesInBlock existed even if there was no additional index in the node, at least external indexers could build their own index off-chain.

I'm not suggesting that address_getAppearances doesn't get added. I think it should because it's so necessary, but a good starting point might be eth_getAddressesInBlock.

perama-v commented 1 year ago

Here are some notes after reviewing existing structures

AccountHistory comparison

There is already the AccountHistory table that stores all transaction identifiers where an address (account) state was changed.

This AccountHistory is a subset of the data address_getAppearances method would return. Additional data includes items such as:

So, a new table (such as AppearancesHistory) is required to capture this additional data. Either:

Use global TxNumber

Fortunately, transactions are represented by a global number TxNumber rather than index in block.

Relevant tables:

That means we can have a simple mapping of Address -> List(TxNumber)

IntegerList

So, similar compression can be used for an AppearancesHistory table

New AppearanceHistory table

A new table AppearanceHistory of a similar nature to AccountHistory could be made

///  AppearanceHistory (new) table

/// List of TxNumber (global transaction number, not index in block)
pub type TransactionNumberList = IntegerList;
table!(
/// Stores pointers to transaction appearance-set with appearances for each address key.
/// ( AppearanceHistory ) Address | TransactionNumberList
)

Response algorithm

If reth stores global tx numbers they can be converted to (block_number, transaction_index) for the JSON-RPC method address_getAppearances response.

// Pseudocode for address_getAppearances method response construction.
fn address_get_appearances(address: Address) -> Vec<(BlockNumber, TxIndex)> {
    // AppearanceHistory table (new)
    let tx_list: TransactionNumberList = appearance_history_table.get(address);
    let mut appearances = vec![];
    for global_tx_num in tx_list.items() {
        // TransactionBlock table
        let block_num = transaction_block_table.get(global_tx_num);
        // BlockBodyIndices table
        let tx_index_in_block = block_body_indices_table.get(global_tx_num);
        appearances.push((block_num, tx_index_in_block));
    }
    appearances
}

New IndexAppearanceHistoryState stage

A stage called IndexAppearanceHistoryState could exist, structurally similar to IndexAccountHistoryStage

TODO - when is a good time to do this stage?

joshieDo commented 1 year ago

i'm conservatively supportive. Between this, and otterscan index needs, we probably need better abstractions to accommodate feature/flag based indexes without being too messy

I'm skeptical about the estimate... 80GB seems low, but if it's optional it's whatever

A contract is called as a read-only method.

If i'm not mistaken, this implies having to do "something" during the execution stage as well

perama-v commented 1 year ago

I'm skeptical about the estimate... 80GB seems low, but if it's optional it's whatever

The estimate is based on the UnchainedIndex, which has the same data (address-->appearances), but is represented in a custom format suited for distribution by 12 hour block ranges (total size ~80gb, no compression)

If i'm not mistaken, this implies having to do "something" during the execution stage as well

Yes, for each tx certain locations are checked for addresses. So this is additional work that must be performed. The specifics are in the spec in the linked execution apis PR.

The work roughly amounts to a trace with calltracer, and checking fields within that response. Plus a check of addresses in withdrawals and block rewards.

perama-v commented 1 year ago

Added a task list for the foreseeable steps.

Regarding the proposed implement IndexAppearanceHistoryState stage:

One consideration is that it this re-executes the transactions, separately from the Execution stage. One benefit is that it keeps the Execution stage clean. Another is possible reuse of the CallTracer for a simple implementation. Instead to avoid re-execution of transactions, it could be brought into Execution stage.

tjayrush commented 1 year ago

So, a new table (such as AppearancesHistory) is required to capture this additional data. Either:

  • New table contains all appearances (including state changes covered by AccountHistory table). This is duplication, but with compression (see below) it may be less significant.
  • New table only contains appearances that are not covered by AppearancesHistory table. For address_getAppearances, one then looks in both tables and combines result. One downside is that the AccountHistory table stores block numbers, which may be less suitable (slower, more complex) than storing transaction numbers.

There's a third option here. Enhance the AccountHistory table to include all appearances (both state change and non-state changing). That would serve three purposes:

1) It would work identically the same for the current AccountHistory use case, but need to filter out non-state-changing appearances, 2) It would serve all the OtterScan use cases (the Appearance database is a super-set of the Etherscan-like use case as discussed here: https://tjayrush.medium.com/how-accurate-is-etherscan-83dab12eeedd and here: https://medium.com/coinmonks/trueblocks-covalent-comparison-7b42f3d1e6f7.) The needs of a blockchain explorer are lesser than the needs of an complete appearance indexer, 3) It would cut down on nearly all additional disc space needed to support the feature. Only non-state-changing appearances would have to be added which are significantly less than the total.

tjayrush commented 1 year ago

i'm conservatively supportive.

I would encourage you to think very carefully about this. Adding what I call a "complete" index of appearances allows one to produce a "complete" transactional history for any account. This allows for perfect 18-decimal place accurate re-creation of an accounts activity off-chain. This allows perfect, 18-decimal place accurate automated accounting every 12 seconds.

If you ever wondered why no-one can produce automated tax information from a system that comes to balance for 200,000,000 million accounts every 12 seconds, it's because there is no way to get a complete history.

tjayrush commented 1 year ago

I'm skeptical about the estimate... 80GB seems low, but if it's optional it's whatever

jrush@linux:/m/m/t/v/u/mainnet➤ du /mnt/md0/trueblocks/v0.40.0/unchained/mainnet/
424K    /mnt/md0/trueblocks/v0.40.0/unchained/mainnet/unripe
3.9G    /mnt/md0/trueblocks/v0.40.0/unchained/mainnet/blooms
95G /mnt/md0/trueblocks/v0.40.0/unchained/mainnet/finalized
4.0K    /mnt/md0/trueblocks/v0.40.0/unchained/mainnet/ripe
4.0K    /mnt/md0/trueblocks/v0.40.0/unchained/mainnet/maps
73M /mnt/md0/trueblocks/v0.40.0/unchained/mainnet/staging
99G /mnt/md0/trueblocks/v0.40.0/unchained/mainnet/
jrush@linux:/m/m/t/v/u/mainnet➤ du /mnt/md0/trueblocks/v0.40.0/unchained/sepolia
4.0K    /mnt/md0/trueblocks/v0.40.0/unchained/sepolia/unripe
5.5M    /mnt/md0/trueblocks/v0.40.0/unchained/sepolia/blooms
55M /mnt/md0/trueblocks/v0.40.0/unchained/sepolia/finalized
4.0K    /mnt/md0/trueblocks/v0.40.0/unchained/sepolia/ripe
4.0K    /mnt/md0/trueblocks/v0.40.0/unchained/sepolia/maps
4.0K    /mnt/md0/trueblocks/v0.40.0/unchained/sepolia/staging
74M /mnt/md0/trueblocks/v0.40.0/unchained/sepolia

Also, if it was combined into the AccountHistory table, it would be that much smaller.

sslivkoff commented 1 year ago

I'd love this

thinking aloud here, does putting this in an address_ namespace make sense, as opposed to some other name?

down the road I would love to see additional advanced querying rpc methods (like getBlockAccessList / getTransactionAccessList). not all of these additional methods will be address-related, but they have a similar enough flavor that they might go in the same namespace

so maybe something like query_blockAddresses / query_AddressAppearances would be more future proof?

perama-v commented 1 year ago

I'd love this

:beer:

thinking aloud here, does putting this in an address_ namespace make sense, as opposed to some other name?

Great and important question. I am open to this, let's discuss over where namespace is defined

A2be commented 1 year ago

yes, am supportive. This sort of capability in the EVM node software seems to have been an oversight in the original EVM spec.

Unclear what storage is required, but as long as it is optional, can be a node-specific index. I would gladly trade ~100 GB for such an index for my nodes.

rakita commented 1 year ago

This AccountHistory is a subset of the data address_getAppearances method would return. Additional data includes items such as:

  • A contract is called as a read-only method.
  • An address is mentioned in a log without a state change.

A contract read is a superset of a log.

So, a new table (such as AppearancesHistory) is required to capture this additional data. Either:

  • New table contains all appearances (including state changes covered by AccountHistory table). This is duplication, but with compression (see below) it may be less significant.
  • New table only contains appearances that are not covered by AppearancesHistory table. For address_getAppearances, one then looks in both tables and combines result. One downside is that the AccountHistory table stores block numbers, which may be less suitable (slower, more complex) than storing transaction numbers.

Would definitely go with the first option, as the second path introduces dependency that increases complexity.

The estimate is based on the UnchainedIndex, which has the same data (address-->appearances), but is represented in a custom format suited for distribution by 12 hour block ranges (total size ~80gb, no compression)

Would double that estimation as we need to save this inside db, but db size is not that relevant for discussion.

There's a third option here. Enhance the AccountHistory table to include all appearances (both state change and non-state changing). That would serve three purposes:

AccountHistory has index to the block but it seem for this rpc it is more preferred to go with tx index.

If you ever wondered why no-one can produce automated tax information from a system that comes to balance for 200,000,000 million accounts every 12 seconds, it's because there is no way to get a complete history.

Although this endpoint would include just an Ethereum balance change, it would not have the address/storages of the ERC20 contract storage slots that are, I assume, needed for accounting.

imo appearance is defined very broadly here. And we are talking here just for account appearance (storage is not mentioned). For example to be on the same page, EXTCODESIZE opcode would add this address to the appearance list, access_list defined inside tx would add it, any CALL opcode would add it and any post transaction state change would add it, etc.

The request is to have a list of transaction numbers (not blocks number that we use in AccountHistory) this is not possible with current tx numbers as we have appearances at the end of the block that would fall in the middle of two transactions (last of the block and first of the next block). When we have tx level changesets we had an additional index transition index to accommodate this.

On the naming (not relevant), I would see this name as reth_getAccountApperances as clients with not standardized rpc follow this format. etc https://github.com/paradigmxyz/reth/pull/3768

tjayrush commented 1 year ago

Although this endpoint would include just an Ethereum balance change, it would not have the address/storages of the ERC20 contract storage slots that are, I assume, needed for accounting.

imo appearance is defined very broadly here. And we are talking here just for account appearance (storage is not mentioned). For example to be on the same page, EXTCODESIZE opcode would add this address to the appearance list, access_list defined inside tx would add it, any CALL opcode would add it and any post transaction state change would add it, etc.

The request is to have a list of transaction numbers (not blocks number that we use in AccountHistory) this is not possible with current tx numbers as we have appearances at the end of the block that would fall in the middle of two transactions (last of the block and first of the next block). When we have tx level changesets we had an additional index transition index to accommodate this.

I totally understand everything you're saying here, and yes, the index as original envisioned would index into <block.txid> and I understand why that would be hard. Two things:

1) I think the next version of Erigon is going to try to index per transaction in some places so they can report on per-transaction changes in balance...

2) I haven't thought about this much, but a block-level index may be interesting. It would greatly reduce the size of the index, and would slow down the query, but we use local caching, so that only matters the first time a user retrieves a history. I do understand why this would be much easier for existing nodes.

Concerning the issue of state change, you're exactly right. All the existing indexes that I'm aware of in the node ignore the changes to storage where the thing being stored is an address. This is exactly the problem. This is why it's impossible to re-create state changes (including accurate balance histories) off-chain. I would almost go so far as to say that implementing an index that does not capture these types of appearances (where an address is "used" by a smart contract) wouldn't be worth it. You can get all the other "appearances" already.

This "adding all appearances" may be more complicated that you might thing though because a HUGE number of such appearances happen as left-side zero-padded 32-bytes. You have to use some heuristics to identify things that "appear" to be addresses. I've written extensively about this issue on pages 32-41 of this document: https://trueblocks.io/papers/2023/specification-for-the-unchained-index-v0.51.0-beta.pdf

Concerning naming, I think I probably fall down on the side of reth_ as that is the way nodes add "non-standard" names spaces. It's not perfect, but it avoids a lot of complicated negotiation with others and that problem (what namespace) can be solved later.

perama-v commented 1 year ago

Would definitely go with the first option, as the second path introduces dependency that increases complexity.

Thanks

imo appearance is defined very broadly here. And we are talking here just for account appearance (storage is not mentioned). For example to be on the same page, EXTCODESIZE opcode would add this address to the appearance list, access_list defined inside tx would add it, any CALL opcode would add it and any post transaction state change would add it, etc.

Yes, it is a broad definition to answer the question: "which transactions should I look further into for my address?". While broad in coverage, the Appearances spec (see link at top of issue) aims to have a precise inclusion definition. See also the implementation in #4162. The EXTCODESIZE opcode and storage activity are not part of the definition.

Although this endpoint would include just an Ethereum balance change, it would not have the address/storages of the ERC20 contract storage slots that are, I assume, needed for accounting.

The main idea is to be agnostic to the end use case. The method returns a list of relevant transaction ids that could be used by the user to start ERC20 accounting, ERC721 accounting, ether accounting, or any other purpose. They would fetch and inspect/re-execute transactions according to their needs, which may or may not involve looking at storage changes.

The request is to have a list of transaction numbers (not blocks number that we use in AccountHistory) this is not possible with current tx numbers as we have appearances at the end of the block that would fall in the middle of two transactions (last of the block and first of the next block). When we have tx level changesets we had an additional index transition index to accommodate this.

I have implemented a possible mechanism to achieve this, using a wrapper that adds bitflags to a TxNum. E.g., A "miner" recipient in block n is represented as the TxNum of the first tx in that block, with a specific flag in the upper bits. So all data can be kept in one table:

tjayrush commented 1 year ago

The method returns a list of relevant transaction ids that could be used by the user to start ERC20 accounting, ERC721 accounting, ether accounting, or any other purpose.

I wanted to highlight this for two reasons: (1) the definition of appearance should be as broad as possible so that it includes any possible use case-let the end user decide which particulars they are interested in, and (2) almost every time I explain this to someone, I use the idea of accounting, but the "any other purpose" of this may be the most important. With a complete list of appearances, once could re-create any state change whatsoever. Not just changes in balances. Changes in any state. This also argues for as inclusive an index as possible. That's why I focus on what I call "Every appearance of any address anywhere on the chain."

But, note that there has to be the notion of a badress (that is otherwise valid addresses that are excluded because if one does not exclude them the index ballons in size). Every single 20-byte string of bytes is a valid address, but, as an example, 0x0000000000000000000000000000ffffffffffff probably isn't one. All pre-compiles are also baddresses. Part of the very clear definition of an appearance has to exclude baddresses.

gakonst commented 1 year ago

Love the progress on all this - just to say it explicitly, we're happy to iterate on this in the branch, but ideally we can figure out ways to introduce APIs like this w/o necessarily introducing large main codebase maintenance overheads..

We already have utilities for extending the CLI, see here: https://github.com/paradigmxyz/reth/blob/main/examples/additional-rpc-namespace-in-cli/src/main.rs#L52. So it might be a matter of showing how to also run additional stages.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 14 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

A2be commented 1 year ago

Happy to see this reopened, just 'cause I'm really looking forward to seeing Reth implement a solution to, in my view, the early failure to spec this into the EVM in 2015.