phronmophobic / dewey

Index of Clojure libraries available on github.
Eclipse Public License 1.0
80 stars 1 forks source link
clojure github open-datasets

Dewey

Index of Clojure libraries available on github.

Analysis:

Web frontends:

Rationale

The goal of this project is to make the clojure libraries available on github easier to programmatically list and inspect.

Deps.edn can procure dependencies directly from github. However, finding clojure libraries that are available via github can be more difficult compared to clojars. Clojars provides several data endpoints to list available libraries and metadata. Even though similar info is available from github, it's not quite as easy to obtain.

Getting the data

Pre-retrieved data can be found at releases.

What's included?

Each release includes the following files in .gz or tar.gz format:

All the .edn or .edn.gz files can be read using com.phronemophobic.dewey.util/read-edn. For example:

(require 'com.phronemophobic.dewey.util)
(def data (com.phronemophobic.dewey.util/read-edn fname))

Analysis Data

clj-kondo analyses for each project found can be found in the releases under analysis.edn.gz. This file can be quite chonky. For an example of how to process the data, see the stats example.

The file contains a vector of maps, with each map containing the following keys:

The file is specially formatted edn so that it can be processed without reading the full contents into memory. The first line is [, the last line is ], and every line in between is a single map.

Generating the dataset via the github API

To retrieve the data yourself, follow step 0 and then run:

# creates releases/yyyy-MM-dd/all-repos.edn
clojure -X:update-clojure-repo-index

# downloads all deps files to releases/yyyy-MM-dd/<user>/<project>/deps.edn
# due to rate limits, takes around 3 hours (mostly sleeping).
clojure -X:download-deps

# downloads tags for each deps.edn clojure library to releases/yyyy-MM-dd/deps-tags.edn
clojure -X:update-tag-index

# creates an index of library name to library metadata in releases/yyyy-MM-dd/deps-libs.edn
clojure -X:update-available-git-libs-index

These commands must be run in order.

Finding Clojure Libraries Methodology

Github search is quirky and has certain limitations imposed by rate-limiting. Below is a short synopsis of how Dewey attempts to locate clojure projects on github within the limitations imposed by github's API.

Current Method

  1. Authentication
  2. Find all clojure repositories
  3. Download all deps.edn files

0. Authentication

Dewey uses personal access tokens to make github API requests. You can obtain a personal access token by following these docs.

Once you have obtained your personal access token, save it to an edn file called "secrets.edn" in the root project directly using following format:

{:github {:user "my-username"
          :token "my-token"}}

1. Find all clojure libraries

Currently, the first step is to paginate through the results of the github repository search language:clojure sorted by stars in descending order. There's a 1,000 result limit for any specific search so after exhausting the results from language:clojure, we find repositories for specific numbers of stars starting at the star number from the last result. The search query for these requests look like language:clojure stars:123, language:clojure stars:122, etc.

2. Download all deps.edn files

Once we have a list of clojure github repositories, we can then check each repository for its deps.edn file. Given a repository, the url for the deps.edn file looks like (str "https://raw.githubusercontent.com/" full-name "/" default-branch "/" fname))).

Current known limitations

Failed Strategies

Searching github code with "filename:deps.edn"

I thought just asking github for all the files named deps.edn might work. The roadblocks I ran into were:

  1. Hitting secondary rate after 1-2 requests.
  2. Receiving only 0-3 results even on successful requests.

Alternative Strategies

These are stategies that I didn't try, but might be good alternatives if the main strategy fails.

Scanning clojure repos by created or updated

As suggested by this stackoverflow answer, you can search by a field. The search API currently limits results to a max of 1000, but if you search a small enough window of time, you can scan through all the libraries.

Relevant Github docs

Use github's GraphQL API

It's possible that github's GraphQL API might provide opportunities for improvement. However, it doesn't appear to have a way to filter by language or any other means of identifying repositories that are clojure related.

Future Work

Now that we've bothered to catalog of all of the clojure repos on github, there's several interesting projects we can do that use the data:

License

Copyright © 2022 Adrian

Distributed under the Eclipse Public License version 1.0.