ossf / wg-securing-critical-projects

Helping allocate resources to secure the critical open source projects we all depend on.
Apache License 2.0
331 stars 40 forks source link

Some possible data sources to identify package managers, build systems and compilers (build toolchain) #41

Open bureado opened 2 years ago

bureado commented 2 years ago

The current spreadsheet shows package managers as candidate projects, and has build toolchains (generally comprising build systems, compilers and associated tooling) in the considered list. While the list is not overwhelmingly big, I suggest using existing taxonomies to seed this list. Here are a few examples:

WikiData

https://en.wikipedia.org/wiki/List_of_software_package_management_systems

Since that list doesn't "sound" structured, see https://www.wikidata.org/wiki/Q6639720 and then something like https://www.wikidata.org/wiki/Q98400269 leading to this in-wikidata index or this WikiData query which can help with the transitive closure (e.g., package managers/compilers/tools involved in delivering another critical component)

GitHub topics

Note that a significant number of components in this category predate GitHub (and git) and might not have a mirror or otherwise have a clear footprint in the following list.

https://github.com/topics/package-manager

debtags

Several Debian packages are tagged via debtags, relevant facets include:

https://debtags.debian.org/reports/taginfo/devel::compiler (n=150) https://debtags.debian.org/reports/taginfo/devel::buildtools (n=150) https://debtags.debian.org/reports/taginfo/devel::runtime (n=50) https://debtags.debian.org/reports/taginfo/admin::package-management (n=150) https://debtags.debian.org/reports/taginfo/works-with::software:package (n=160) https://debtags.debian.org/reports/taginfo/works-with::software:source (n=400)

Such tags can be queried via a local debtags utility, a facility like apt-xapian-index, a point-in-time snapshot of the UDD database (e.g., in a Postgres instance) A benefit of using UDD is that the tagged package names can be joined with other data such as the upstream project URL which can aid in resolving the Debian package name to something more universal. Alternatively, approaches such as https://github.com/repology/repology-rules can be used.

A few tags can also help approximate a "critical to trust" definition. I'll open another issue on that topic in particular.

In general, the n= of the tags above hints that the work can also be done manually or crowdsourced; one example of a more direct list of package managers/ecosystems of interest would be https://libraries.io/languages.