nexB / vulnerablecode

A free and open vulnerabilities database and the packages they impact. And the tools to aggregate and correlate these vulnerabilities. Sponsored by NLnet https://nlnet.nl/project/vulnerabilitydatabase/ for https://www.aboutcode.org/ Chat at https://gitter.im/aboutcode-org/vulnerablecode Docs at https://vulnerablecode.readthedocs.org/
https://public.vulnerablecode.io
Apache License 2.0
503 stars 184 forks source link

Scrape Github security advisories using HTML scraping #297

Open pombredanne opened 3 years ago

pombredanne commented 3 years ago

The GitHub advisories are somewhat weird:

  1. the graphql API data require auth and are incomplete (they do not contain external references)
  2. the HTML data at https://github.com/advisories contains more data, BUT this is limited to 40 pages of 25 advisories, meaning only 1000 can be scraped from the browse page when there are about 3019 advisories. The reference are there including quite often the fixing commit

Therefore I think we should use either:

  1. a hybrid model where we get the list of advisories from the Graphql API calls and then scrape individual pages
  2. a pure HTML model where we issue several searches to browse subset of the data that are less than 40 pages each and hope to hone on the full 3000+ advisories.

Some scraper exists at https://github.com/yusufsn/local-repo/blob/87054815200d3add63f201d9feb1e2bedd18d0d6/code/urls_crawlers.ipynb#L177

sbs2001 commented 3 years ago

Btw, I just found out that the graphql api also provides the references and those matchup wiht the data at web ui .

pombredanne commented 3 years ago

Btw, I just found out that the graphql api also provides the references and those matchup wiht the data at web ui .

Much better then!

pombredanne commented 1 year ago

The combo of the graphql API and the OSV-formatted git repo could make this moot.... or not. The https://github.com/github/advisory-database/blob/5b6aa08e4edaca41f91dbe18cf8c6fd65cefe528/advisories/github-reviewed/2023/01/GHSA-c653-6hhg-9x92/GHSA-c653-6hhg-9x92.json JSON does not contain the "credit" information from https://github.com/advisories/GHSA-c653-6hhg-9x92 and the data structure is different, in a likely lossy way.