Consider integrating contract addresses found on BigQuery

volsa commented 1 year ago

The current implementation to find source code on Etherscan consists of two steps:

Check the https://etherscan.io/contractsVerified page every few minutes for any updates, inserting addresses among other metadata into the database
Call the Etherscan API to fetch source code and scrape signatures from newly found addresses, inserting them into the database

While this works, the biggest drawback is that we don't have a list of all contract addresses with their associated source code currently hosted on Etherscan. Instead we only have the latest contract addresses starting from the initial deployment of Etherface.

Google BigQuery hosts the following dataset, which offers (all?) contract addresses on the Ethereum network. In theory we can populate these addresses into the Etherface database (more specifically into the etherscan_contract table), which will then simply execute step 2. if and only if the address has source code hosted on Etherscan. This feature could (and probably will) increase the number of signatures found on Etherscan as well as the number of unique signatures dramatically.

allenday commented 1 year ago

SELECT  address FROM `bigquery-public-data.crypto_ethereum.contracts` WHERE DATE(block_timestamp) = "2022-11-23"

can run this once a day using a DATE interval instead of a specific day in WHERE clause. This will cut down on IO to only the most recent (yesterday) partition.

kalanyuz commented 1 year ago

Want to help with this. Where should I start?

allenday commented 1 year ago

we need to understand from @volsa how new contract addresses / URLs can be added to the scraper queue

allenday commented 1 year ago

it's also possible to reduce the amount of scraping needed by looking at the contract creation bytes, because identical bytes will always produce identical contracts regardless of the address.

SELECT address FROM `bigquery-public-data.crypto_ethereum.contracts` WHERE DATE(block_timestamp) = "2022-11-29" 
AND bytecode NOT IN
(
  SELECT DISTINCT bytecode FROM `bigquery-public-data.crypto_ethereum.contracts` WHERE DATE(block_timestamp) < "2022-11-29"
)

this query isn't optimal though - what's needed is a hash of the bytecode in blockchain-etl, see issue https://github.com/blockchain-etl/ethereum-etl-airflow/issues/480

kalanyuz commented 1 year ago

Interesting. I'm doing something similar where I use:

SELECT DISTINCT TO_HEX(MD_5(bytecode))

instead of farm_fingerprint

volsa commented 1 year ago

If you look at the github_repository and etherscan_contract table[1] you'll see a scraped_at field. This field is used as the scraping queue, i.e. every n minutes the scraper/github.rs and scraper/etherscan.rs modules are checking if there are any entities in the database where scraped_at is NULL and if so their source code is fetched, parsed and the scraped_at field is updated. Initially all data inserted into the database by the fetcher modules has the scraped_at field set to NULL; take a look at the etherface/src/fetcher/etherscan.rs and etherface/src/scraper/etherscan.rs files to get a better understanding of the fetching and scraping / parsing process.

Anyways, in theory this issue can be fixed by modifying the fetcher/etherscan.rs file to not only feed contract addresses into the scraping queue from https://etherscan.io/contractsVerified but also BigQuery. There are a few issues I currently see when using BigQuery however:

The etherscan_contract table has a few metadata fields such as name, compiler and compiler_version. These fields are currently flagged as NOT NULL simply because they appear on https://etherscan.io/contractsVerified, hence that has to be changed.
Obviously Etherscan does not host source code for all contract addresses on the Ethereum Network. Assuming we'll feed all addresses from BigQuery into Etherface, the etherscan_table will have a lot of records where scraped_at is NULL because these do not exist on Etherscan and consequently can't be scraped. This is only a nitpick however and not really an issues.
As a result of 2. we'll probably need to introduce another field into etherscan_table such as has_source_code to flag if the address has any source code hosted on Etherscan. Alternatively we can ignore this and simply set scraped_at to todays date when receiving a 404 error indicating the source code does not exist on Etherscan. This is needed because the scraper would otherwise re-visit all records where scraped_at is NULL. I'd prefer the former solution though.

Feel free to work on this if you want, but just as a heads-up I was planning on re-writing Etherface once I find some time (probably in a few months) where I'll probably also address this issue.

[1] https://github.com/volsa/etherface/blob/master/migrations/2022-03-06-133006_etherface_database/up.sql

volsa commented 1 year ago

Also consider changing the fetchers and scrapers variables inside the etherface/src/main.rs file to

let fetchers: Vec<Box<dyn Fetcher + Sync + Send>> = vec![Box::new(EtherscanFetcher)];
let scrapers: Vec<Box<dyn Scraper + Sync + Send>> = vec![Box::new(EtherscanScraper)];

to only start the Etherscan fetcher and scraper; makes development much easier because you'll get less log clutter.

allenday commented 1 year ago

Regarding the NULL fields referred to in (1), would it be acceptable in the short term to define a NULL token, like <NULL> and insert this instead? It could be used operationally as a placeholder until the rewrite.

allenday commented 1 year ago

Regarding (3), here's where we need to detect the 404 as an alternate block: https://github.com/volsa/etherface/blob/master/etherface/src/scraper/etherscan.rs#L28

allenday commented 1 year ago

Regarding (2), and also in relation to:

github scraping, and
https://github.com/volsa/etherface/issues/10 I propose adding a new table for storing contract source code.

It looks like etherscan API only supports a single version of the source [1]. But github does, and I'm not sure if we would need to track the file version.

[1] https://docs.etherscan.io/api-endpoints/contracts#get-contract-source-code-for-verified-contract-source-codes

volsa commented 1 year ago

Regarding the NULL fields referred to in (1), would it be acceptable in the short term to define a NULL token, like and insert this instead? It could be used operationally as a placeholder until the rewrite.

Yes, shouldn't be a problem; in fact if you execute the following query you'll get a few records where the metadata is missing because these addresses where taken from https://github.com/aphd/smart-corpus-api (https://aphd.github.io/smart-corpus/)

SELECT * FROM etherscan_contract WHERE compiler LIKE 'n/a';

I totally forgot about that but I'd prefer if these fields are nullable in the rewrite.

Regarding (2) [...] I propose adding a new table for storing contract source code.

Would this table have any mappings to other tables?

Regarding (3), here's where we need to detect the 404 as an alternate block: https://github.com/volsa/etherface/blob/master/etherface/src/scraper/etherscan.rs#L28

You'll also need to change the following function https://github.com/volsa/etherface/blob/master/etherface-lib/src/api/mod.rs#L196 to detect 404 errors because for whatever reason the Etherscan devs decided to always return a 200 HTTP response regardless if the request was actually successful. Instead they wrap the actual errors in the JSON response body. @kalanyuz feel free to ping me with any questions if you decide to work on this :)

allenday commented 1 year ago

Would this table have any mappings to other tables?

In the spirit of etherface as an operational DB, I'd make the source code linked to etherscan_contract.address as a FK.

Otherwise, could introduce a new mapping table of contract <> hash(bytecode) and store the source code using hash(bytecode) as the FK.

Definitely the hashed bytecode needs to be done at some point because of the long tail nature of contracts, and a design decision needs to be made - how much of this work to do in etherface and how much to push to BigQuery?

volsa commented 1 year ago

I'd be down to also store contract code; for Etherscan this should be quite simple because of its immutable nature, for GitHub this will be much trickier though.

Definitely the hashed bytecode needs to be done at some point because of the long tail nature of contracts, and a design decision needs to be made - how much of this work to do in etherface and how much to push to BigQuery?

Since we would store the source code in the database anyway, there is nothing stopping us from also storing its hashed bytecode; out of curiosity how would we off-load this to BigQuery though? My understanding was that we restore database dumps into BigQuery, would we then just calculate these hashes in the restoration process?

Anyways I'll create a tracking issue for the Etherface rewrite in the coming days where I'll summarize all features mentioned in the last few days among other ones.

allenday commented 1 year ago

Since we would store the source code in the database anyway, there is nothing stopping us from also storing its hashed bytecode; out of curiosity how would we off-load this to BigQuery though? My understanding was that we restore database dumps into BigQuery, would we then just calculate these hashes in the restoration process?

You're right, there's nothing preventing etherface from calculating and storing hashed bytecode as well as sources, whether linked directly to addresses or via the hashed bytecode as a M2M key.

I guess that where our thinking may differ is that I'm mainly aimed at making more analytical data available in BQ to supplement the crypto_ethereum, crypto_polygon, etc datasets.

Having the data in BQ is for the purposes of building dashboards, calculating metrics (e.g. for plot frequency of signatures in traces vs frequency of signatures in contracts, identifying which signatures are most common but have no known text / source), etc.

Some of what would be useful to import to BQ from Etherface you've already written - this is why I started to engage with the project, thank you! I imagine that there other things you envision for Etherface that aren't necessarily useful for the BigQuery supplementation use case. I don't have a strong opinion about these, but can contribute some mental effort to help keep the project healthy.

Now, regarding source code, my guess is that we'll want to integrate, either using BQ as the DW or Pg, an airflow or dataflow job that does post-processing of the sources to extract more useful info. In order to do this, we'd need to:

lint
normalize
unbundling tuples
dereferencing parameter types to their primitive types
apply some regex to extract named params
...and I don't know what else

One specific case I have in mind is auto-generating CREATE VIEW statements from named method parameters.

It's probably easier to use BQ as the DW for doing this, as there is already strong support for integrating with other GCP services. One point in favor of Pg would be the lower latency of it as an operational DW, but the rate of contract creation is relatively low, and the latency to get into BQ can be high - a daily job is good enough. So it's not really hitting major requirements, at least from this supplementation perspective.

volsa / etherface

Consider integrating contract addresses found on BigQuery #9