Peek Address apperances with Trueblocks?

DistributedDoge commented 3 months ago

Range queries can be hard to do efficiently without hammering a node because asking for "Vitalik transactions from block 0 to now" may require checking every block. This is a problem if someone has to rely on remote public node with rate limits.

We don't strictly need every block to answer that query, but without knowing which blocks contain Vitalik transactions, what choice do we have?

Trueblocks is an indexer whose maintainers provide an IPFS-hosted unchained index which can be used to find all blocks containing an "apperance" of a given Ethereum address.

The Unchained index is updated regularly, but infrequently (I think last update was 30 days ago)

Still, if we pretend to have accurate index, we can try such strategy:

start with address we want to check -> vitalik.eth -> resolves to 0xabc...
download some chunks of unchained index via chifra to find all apperances of that address, we now have list of relevant blocks
query only relevant blocks via RPC-API, cryo or whatever tool EQL uses under the hood

The caveat is this would makes performance better for obscure addresses yet worse for popular ones:

The number of index chunks you download depends entirely on the addresses you query. Of course, if you query a lightly used address, very few chunks will be downloaded. However, if you query are heavily used address, such as UniSwap, nearly every chunk will be downloaded

This could be blind alley, but I think it is worth checking out so would happy to test how many "chunks" are needed by chifra and report the results.

iankressin commented 3 months ago

I spent sometime digging into Trueblocks and here are some thoughts:

Pros:

The project is aligned with EQL's end-game when it comes centralization vectors in the space. It distributes the index via IPFS to make faster to sync data locally, this is really similar with what I described as a possible solution in the release post.
Once synced, queries are really fast.

Cons:

Leverage Trueblocks' index implies that every user/application will need have a local indexer running, since Trueblocks doesn't provide a crate to be compiled along with EQL.
Users would need to wait until their local indexer is synced before running queries, which is somewhat a bad UX.

In regards to heavily used addresses and I think we should be fine given users the option to wait for large queries or to provide a max amount of records being returned.

so would happy to test how many "chunks" are needed by chifra and report the results. That would be amazing!

DistributedDoge commented 3 months ago

Tried chifra init which is supposed to be fast-now, slow-later. This indeed turns out to be true.

For crypto OG address like owocki.eth the lookup with chifra list takes 5 minutes or more.
For lite-address with 30 transactions lookup happens in seconds though.

I am not expecting magic to happen - someone has to trawl the blockchain and download the data, but as you hinted running an indexer that trawls IPFS is only "slightly less bad" UX-wise than running a full node.

Maybe Trueblocks folks can figure something for the skill issue I am having here - but for now it looks like we traded hammering JSON-RPC node for hammering IPFS gateway.

This seems a bit hardcore to implement, but hopefully that fancy BTree structure presented in the paper would prove to be more efficient.

iankressin commented 3 months ago

For crypto OG address like owocki.eth the lookup with chifra list takes 5 minutes or more.

For lite-address with 30 transactions lookup happens in seconds though.

This doesn't sound bad, actually. Better than expected. Tysm for the report!

Maybe Trueblocks folks can figure something for the skill issue I am having here

I believe that's a limitation of Trueblocks design, not a skill issue.

This seems a bit hardcore to implement, but hopefully that fancy BTree structure presented in the paper would prove to be more efficient.

Yepp. But hopefully we'll find a way to safely distribute that tree using something like IPFS. Not trivial to implement though.

One thing we can work on in the meanwhile is block ranges. We can cycle through public RPCs to avoid rate-limits and implementing a rate-limit on the EQL side

tjayrush commented 3 months ago

Hi. I'm Jay. From TrueBlocks. We make the Unchained Index. You don't to hit the IPFS repeatedly. You could do chifra init --all which would download the entire index to your local drive (120GB) and then the searching is super fast. If you do chifra init you get 4GB of bloom filters which is how we know which larger parts of the index you need. Your choice. Fast at first, slow in the middle, fast in the end or slow at first, fast forever.

tjayrush commented 3 months ago

Leverage Trueblocks' index implies that every user/application will need have a local indexer running

In the end, it's our belief that if we actually want censorship resistant, distributed private access to our own data everyone will do this naturally. In fact, we think this should not only be in the node, but that the node should be in the operating system.

iankressin commented 3 months ago

In the end, it's our belief that if we actually want censorship resistant

That's exactly what EQL wants to build. Something along the lines of a distributed version of Dune

iankressin commented 3 months ago

you get 4GB of bloom filters which is how we know which larger parts of the index you need

How frequently the remote bloom filter is updated?

yulesa commented 2 months ago

This touches a bit on what EQL wants to be. Using the example on the original post: “I want block’s information about all the blocks in which Vitalik made transactions.” A data provider that indexes blocks and transactions can easily solve this. You can make a simple query like the one below:

WITH vitalik_tx AS (
    SELECT block_number
    FROM ethereum.transactions
    WHERE "from" = 0xd8dA6BF26964aF9D7eEd9e03E53415D37aA96045 --vitalik.eth
)
SELECT *
FROM ethereum.blocks b
LEFT JOIN vitalik_tx t ON  b.number = t.block_number

All indexers provide APIs to make the same logic as the SQL code above. The only reason you can’t do this with a simple RPC is that although nodes store all transaction data, they don’t index it by sender address, so you can’t filter the transactions by the sender—a trivial indexing functionality.

Now, if you don’t want to use a data provider that indexes those raw data for you, you would need to index the data yourself. That would be fine, but EQL is already targeting users who don’t want to run nodes themselves by using RPC providers. What is the point of requiring a user to index data?

Full decentralization means you need to do the job yourself.

Why don't Ethereum clients (or the protocol) already index transactions? It does this for logs using the logs bloom. I think this comes with a design decision by generic client's devs. They are building clients for everyone, and they reasoned that not everyone would want this functionality. Nothing prevents you from making your own client fork that index and distributing it for anyone to run.

Now, EQL could develop tools to make indexing easier and cheaper because if it’s easier and cheaper to index, more businesses can provide it, which will lead to more competition, cheaper access, and less trust and dependency in one player. If you’re doing this, you would probably want to catch up with the tip of the chain, which requires your own node. Indexing would be just a simple extension to better store the data a node already has.

Maybe I'm explaining something that is obvious to anyone in this conversation; if that's the case, sorry for that.

tjayrush commented 2 months ago

This was obvious to us seven years ago. That's why we've been arguing indexing should be part of the node software. But blind Web 2.0 like indexing will never work. It's monotonically increasingly large. It never gets smaller. That's why TrueBlocks "shards and shares naturally" as we discuss here: https://trueblocks.io/papers/2023/specification-for-the-unchained-index-v2.0.0-release.pdf.

Distrubuted data, in our opinion, cannot be in a monolithic database "provided" by providers. Every single participant must carry their own weight -- and share the portion of the data they carry. And, that carrying must be heavily redundant. And it has to be always on. And always pinned. That's why not only must users run nodes, but they must also run file sharing nodes such as IPFS.

tjayrush commented 2 months ago

Here's an article we wrote recently showing why indexing is so important: https://github.com/TrueBlocks/trueblocks-compare/blob/main/results/with-3-providers-2024-06-26.md. Pay attention, in particular, to The Results secion.

iankressin / eql

Peek Address apperances with Trueblocks? #14