opensource-observer / oso

Measuring the impact of open source software
https://opensource.observer
Apache License 2.0
74 stars 16 forks source link

Automatic deployer address crawling #1048

Closed ravenac95 closed 8 months ago

ravenac95 commented 8 months ago

Describe the feature you'd like to request

We need to automatically validate (#1047) and crawl deployer addresses.

An address with a deployer tag is an EOA controlled by a project that is used for deploying smart contracts. For instance, here is a list of deployers that OP Labs use to trace deployments on Dune.

The same deployer is typically used across multiple chains, so this is a better way of tracking multi-chain apps.

Describe the solution you'd like

When a deployer is linked to a project, we should trace all contracts that have been deployed by the address and, if they meet certain requirements, those contracts should become artifacts that are linked to the project. Currently, I use the Etherscan API txlist action and run the following validation logic on the result:

[tx['contractAddress'] for tx in result if not tx['to'] and tx['input'] and tx['isError'] == '0']

At some regular interval (eg, weekly), we should crawl for new deployments and add those contracts.

A user should also be able to trigger a crawl for a project or address manually.

Describe alternatives you've considered

N/A

ravenac95 commented 8 months ago

Decomposition of #1034

ravenac95 commented 8 months ago

So actually there is now more of a cost strategy question on this issue. I have two viable solutions (one easier than the other):

Solution 1

We could do something like this that we would materialize daily

SELECT from_address, to_address
FROM `bigquery-public-data.crypto_ethereum.transactions` 
WHERE to_address is null

So we would just always have all deployers. Then we wouldn't need to continuously query for it when we get new addresses.

Otherwise, full scans for just this is ~189GB right at this time. So after ~5 queries that's ~$7. If we store the values, it that's a maximum of $0.04/GB 189 so $8. The max is this number because to do the query it processes 189GB. With filtering, it would be some fraction of 189 0.04/month. The storage of this data seems cheaper in the long term but it will grow continously. Otherwise, access to this data just immediately makes this process quite simple.

Solution 2

The other solution is just to create a data connector that we run continuously that queries the etherscan API. It's essentially free (100000 req/day). It works at the current size of data but probably isn't sustainable in the long run.

@ccerv1 @ryscheng thoughts? I'm inclined to just store this and then query against the stored values on oss-directory as we need to run validations.

ravenac95 commented 8 months ago

Actually, I'm now convinced using the data warehouse is the only right way. So I'll build towards that unless I have objections