ossf / criticality_score

Gives criticality score for an open source project
Apache License 2.0
1.32k stars 119 forks source link

Ideas to identify direct dependencies #31

Open mboehme opened 3 years ago

mboehme commented 3 years ago

To understand how critical a project P is, it would be worthwile to track which projects directly or indirectly depend on P. The larger this set of dependent projects the more critical it is.

This issue is looking at the first step, to track ways to programatically establish the direct dependencies of a project. Lets find the outdegree first.

mboehme commented 3 years ago

You can find the direct dependencies for a project via the Github GraphQL DependencyGraphManifestConnection.

Ed Summers wrote a small utility, called xkcd2347, that walks the dependencies of a projects up to a given depth.

Install & Use

pip install xkcd2347
xkcd2347 --depth 2 kubernetes/kubernetes # It will ask you Github token.

Use as library

import xkcd2347

gh = xkcd2347.GitHub(key="yourkeyhere")

for dep in gh.get_dependencies('kubernetes', 'kubernetes'):
    print(dep['packageName'])

Where this information comes from

This is what Github parses to construct the Dependency Graph:

Screen Shot 2020-12-13 at 11 04 02 pm

Can we go backwards and find all ancestor dependencies?

Doesn't seem so.

Can Github folks help to analyze the Dependency Graph and get the number of projects that directly or indirectly depend on a given project?

mboehme commented 3 years ago

apt (Advanced Package Tool)

Counting the number of packages that directly or indirectly depend on curl.

$ apt-cache rdepends --no-recommends --no-suggests --no-enhances --recurse curl | grep -v "Reverse Depends:" | wc -l
329966

Counting the number of packages upon which curl directly or indirectly depends.

$ apt-cache depends --no-recommends --no-suggests --no-enhances --recurse curl | wc -l
54366
mboehme commented 3 years ago

npm (Node Package Maintainer)

Counting the number of packages upon which react depends

$ npm install -g npm-remote-ls
$ echo $(( $(npm-remote-ls react | wc -l) - 1))
4
dlorenc commented 3 years ago

@mboehme this is awesome! I really like the graph idea. I think "centrality" could be an awesome signal into the final criticality score (maybe even the dominant one), but I don't see how we could use it to compare across package-manager ecosystems. I'm not aware of any way today to get the full, global graph of dependencies, which is what we would really need here.

For example, the PyPI graph could be amazing for comparing within PyPI, but it would never show you that the CPython itself is a dependency of (almost) every package on PyPI.

dlorenc commented 3 years ago

Also thanks for the pointer to the GraphQL API! I missed this when I was playing around at first, because it's not available for Go yet which is where I started looking.

tgamblin commented 3 years ago

@andrew’s https://libraries.io tracks many package manager ecosystems and has APIs for many things, including dependents (https://libraries.io/api).

It is extensible; you can add support for new package managers: https://github.com/librariesio/libraries.io/blob/master/docs/add-a-package-manager.md

Still on my list to extend it to see spack packages.

mboehme commented 3 years ago

 I think "centrality" could be an awesome signal into the final criticality score (maybe even the dominant one), but I don't see how we could use it to compare across package-manager ecosystems. [..]

For example, the PyPI graph could be amazing for comparing within PyPI, but it would never show you that the CPython itself is a dependency of (almost) every package on PyPI.

I agree. There are certain dependencies that cannot be tracked. For instance, dependence on the kernel or the compiler / interpreter won't be that explicit. The importance of those projects is more visible in the other signals of the criticality score.

dlorenc commented 3 years ago

I agree. There are certain dependencies that cannot be tracked. For instance, dependence on the kernel or the compiler / interpreter won't be that explicit. The importance of those projects is more visible in the other signals of the criticality score.

If that theoretical graph did exist somehow, all of this would be much simpler!

dlorenc commented 3 years ago

@andrew’s https://libraries.io tracks many package manager ecosystems and has APIs for many things, including dependents (https://libraries.io/api).

So many cool things to look at! I had no idea libraries.io had an API. Adding this to the list of things to play around with.

inferno-chromium commented 3 years ago

Adding another pointer from Georgios at Facebook.

https://github.com/fasten-project/fasten

dlorenc commented 3 years ago

I started a quick doc here with notes of playing around with libraries.io: https://docs.google.com/document/d/1Du2rDDd_nueH6BVZmVrrVSSGECnhjde_F3inNT9QzL8/edit#heading=h.yg897byn3xrw

Feel free to add others and join in the fun!

jli commented 3 years ago

I was going to point to Libraries.io, glad you've already come across it 👍

Just a note, I've played with the Libraries.io data a bit and noticed some staleness issues in some cases. Also found some circular dependencies, for example: https://libraries.io/pypi/aniso8601/dependents https://libraries.io/pypi/relativetimebuilder/dependents (I looked into this one, and found an older version of aniso8601 used to depend on relativetimebuilder, perhaps it's a staleness issue?)

Just a heads up!

inferno-chromium commented 3 years ago

Thanks @jli i think from libraries.io sourcerank, we just need to take out the dependent_projects and dependent_repositories calculation https://github.com/librariesio/libraries.io/blob/ad830db5f08c11a82c569c847c04451c57f0a624/app/models/concerns/source_rank.rb#L34

Zac-HD commented 3 years ago

For example, the PyPI graph could be amazing for comparing within PyPI, but it would never show you that the CPython itself is a dependency of (almost) every package on PyPI.

This illustrates that the dependency graph isn't just binary "edge or no edge" - for many Python packages you need CPython or PyPy (or Jython, or...). How do we model a dependency on one-of-N packages?

Many packages also have optional dependencies: for example my own Hypothesis project has minimal mandatory dependencies, but a variety of optional extensions for numeric code, Django, automated refactoring of downstream code, etc. Do those relationships show up?

Runtime vs dev-time dependencies have a similar character, but the latter might be security-critical - your might not worry about a linter, but a compromised compiler could cause a lot of trouble.