Use BigQuery Github dataset instead of search API

volsa / etherface

Ethereum Signature Database

https://www.etherface.io

GNU General Public License v3.0

171 stars 22 forks source link

Use BigQuery Github dataset instead of search API #8

Closed allenday closed 1 year ago

allenday commented 1 year ago

Note that all of Github is already present and kept up-to-date for public read in BigQuery - see:

https://console.cloud.google.com/marketplace/product/github/github-repos

Using this instead of the search API would simplify the codebase (no token rotation) and let new databases populate more quickly with a single scan of the github dataset.

volsa commented 1 year ago

I can't quite remember the reason why but in the initial development phase I explicitly didn't use the BigQuery GitHub dataset although I knew it existed. Retrospectively though I agree in that using BigQuery would simplify the codebase by a lot. Could you perhaps provide me a SQL query which returns all GitHub repositories containing the Solidity language (regardless of whether it's used as a primary or seconday language)? Just so I can see how I would replace the current implementation with BigQuery.

allenday commented 1 year ago

I can support your development with the github dataset, yes. Please email me allenday [at] google [dot] com to discuss.

allenday commented 1 year ago

Here's a query to get started:

SELECT 
  f.repo_name, f.path, c.content, 
  REGEXP_EXTRACT_ALL(c.content, r'(event\s+?\S+?\s*?\(.*?\))') AS events,
  REGEXP_EXTRACT_ALL(c.content, r'(function\s+?.+?\(.*?\))') AS functions
FROM
  `bigquery-public-data.github_repos.contents` AS c,
  `bigquery-public-data.github_repos.files` AS f
WHERE TRUE
  AND f.id = c.id 
  AND LOWER(f.path) LIKE '%.sol'

This finds 31114 records

volsa commented 1 year ago

That number seems quite low, could you report the results of the following two queries

Number of total repositories in the GitHub dataset
Number of total repositories in the GitHub dataset which contain Solidity code / files

Just making sure if replacing the current crawler with BigQuery has any benefits. These are the results of the current Etherface database:

-- 5.876.363 records
SELECT COUNT(*) FROM github_repository WHERE fork IS FALSE; 

-- 197.225 records
SELECT COUNT(*) FROM github_repository WHERE fork IS FALSE AND solidity_ratio > 0.0;

allenday commented 1 year ago

The model is different in BQ and I'm not very familiar with it, but here are some queries with comparable intent:

-- 3,323,531
SELECT COUNT(DISTINCT(repo_name)) FROM `bigquery-public-data.github_repos.files`

-- 83
SELECT COUNT(repo_name) FROM `bigquery-public-data.github_repos.languages` AS x
WHERE EXISTS (SELECT 1 FROM x.language WHERE name = 'Solidity')

volsa commented 1 year ago

I don't see any benefits of replacing the current crawler with BigQuery; closing this for now.