librariesio / libraries.io

:books: The Open Source Discovery Service
https://libraries.io
GNU Affero General Public License v3.0
1.1k stars 206 forks source link

Replace queries against RepositoryDependency with Dependency #3362

Closed tiegz closed 2 months ago

tiegz commented 2 months ago

this PR adds a Repository#projects_dependencies method and adds it to endpoints/codepaths where we want to fetch a repo's deps.

Flowchart: how fetching deps for a repository is changed by this PR

flowchart TB
subgraph "AFTER"
    direction TB
    Repository1["Repository"] --> Project
    Project --> Version
    Version --> Dependency1["Dependency 1\n e.g. 'rack 1.0.0'"]
    Version --> Dependency2["Dependency 2\n e.g. 'dalli ~> 2.0.0'"]
    Version --> Dependency3["Dependency 3\n e.g. 'bibliothecary ~ 1.7'"]
end
subgraph "BEFORE"
    direction TB
    Repository2["Repository"] --> RepositoryDependency4["RepositoryDependency 1\n e.g. 'rack 1.0.0'"]
    Repository2["Repository"] --> RepositoryDependency5["RepositoryDependency 2\n e.g. 'dalli ~> 2.0.0'"]
    Repository2["Repository"] --> RepositoryDependency6["RepositoryDependency 3\n e.g. 'bibliothecary ~ 1.7'"]
end

RepositoryDependency vs Dependency: where is the data from?

RepositoryDependency is populated by scanning all the manifest files for deps in the repository, e.g. https://github.com/librariesio/libraries.io.

Dependency is populated by scanning the actual package's dependencies from its package repository API.

Why?

Scanning all the manifests in a repository for RepositoryDependency is inaccurate for many many repos

e.g. Libraries currently sees an NPM project and records the deps records for its package-lock.json manifest and every dep for all the manifests nested inside node_modules/.

additionally, Libraries is also pulling deps from test folders, template folders, build output folders, educational folders, etc.

Removed repositories take up space

of the top 100 repositories with the most manifests, 12 of them are Removed.

Package deps > repository deps

deps for specific packages are more useful than 100% of the deps found in its repository

* Alternatives exist

GitHub (which accounts for 98% of the repos on Libraries) now has an Insight > Dependency Graph page that can list the deps found in the entire repo, if people still need that data.

Which Libraries endpoints will this affect?

Breaking changes?

Other than the source of the dependency data changing, we will start returning filepath: nil for all repository dependencies from the API, since we don't have a relative filepath to the repo in most cases.

Next step

after this is deployed, and if there are no issues with it, we can followup with a PR to stop ingesting RepositoryDependencies and also get rid of the table.