Closed allenday closed 1 year ago
I can't quite remember the reason why but in the initial development phase I explicitly didn't use the BigQuery GitHub dataset although I knew it existed. Retrospectively though I agree in that using BigQuery would simplify the codebase by a lot. Could you perhaps provide me a SQL query which returns all GitHub repositories containing the Solidity language (regardless of whether it's used as a primary or seconday language)? Just so I can see how I would replace the current implementation with BigQuery.
I can support your development with the github dataset, yes. Please email me allenday [at] google [dot] com to discuss.
Here's a query to get started:
SELECT
f.repo_name, f.path, c.content,
REGEXP_EXTRACT_ALL(c.content, r'(event\s+?\S+?\s*?\(.*?\))') AS events,
REGEXP_EXTRACT_ALL(c.content, r'(function\s+?.+?\(.*?\))') AS functions
FROM
`bigquery-public-data.github_repos.contents` AS c,
`bigquery-public-data.github_repos.files` AS f
WHERE TRUE
AND f.id = c.id
AND LOWER(f.path) LIKE '%.sol'
This finds 31114 records
That number seems quite low, could you report the results of the following two queries
Just making sure if replacing the current crawler with BigQuery has any benefits. These are the results of the current Etherface database:
-- 5.876.363 records
SELECT COUNT(*) FROM github_repository WHERE fork IS FALSE;
-- 197.225 records
SELECT COUNT(*) FROM github_repository WHERE fork IS FALSE AND solidity_ratio > 0.0;
The model is different in BQ and I'm not very familiar with it, but here are some queries with comparable intent:
-- 3,323,531
SELECT COUNT(DISTINCT(repo_name)) FROM `bigquery-public-data.github_repos.files`
-- 83
SELECT COUNT(repo_name) FROM `bigquery-public-data.github_repos.languages` AS x
WHERE EXISTS (SELECT 1 FROM x.language WHERE name = 'Solidity')
I don't see any benefits of replacing the current crawler with BigQuery; closing this for now.
Note that all of Github is already present and kept up-to-date for public read in BigQuery - see:
https://console.cloud.google.com/marketplace/product/github/github-repos
Using this instead of the search API would simplify the codebase (no token rotation) and let new databases populate more quickly with a single scan of the github dataset.