purescript / registry-dev

Development work related to the PureScript Registry
https://github.com/purescript/registry
97 stars 80 forks source link

Introduce structured effects #574

Closed thomashoneyman closed 1 year ago

thomashoneyman commented 1 year ago

IT'S A CHRISTMAS MIRACLE!

Yes, at long last, I've finished the app-effects pull request. This introduces structured effects to the registry. Along the way I found and fixed a lot of little bugs, added lots of logging and (much better) caching, and set the foundation for new features like the HTTP API.

For this PR I suggest you try the code before you review it. If you have a valid cache, then you can run the full legacy import (including mock publishing) in about 4 minutes on my machine. Without the cache it a while to build missing manifests, but subsequent runs are fast. The legacy importer is broken on master right now, so a dry run will "publish" a few packages.

Notably, you only need to have a valid GITHUB_TOKEN environment variable for a dry run. None of the rest are even read from the environment.

git clone git@github.com:purescript/registry-dev && cd registry-dev
git checkout trh/app-effects
nix develop
registry-importer dry-run

One of the first things you'll notice is that logs are written to the console (with colored output). By default non-debug logs are written, but you can choose the verbosity that suits your task. We also print a link to a logfile containing the full debug logs. Logfiles have names like legacy-importer-2022-12-25T21:25:27.log and are created by the Log effect if you use the file system handler. They contain timestamped output like this:

[2022-12-25T21:26:11.212Z DEBUG] Reading metadata for deku
[2022-12-25T21:26:11.213Z DEBUG] Registry repo up to date, reading from cache...
[2022-12-25T21:26:11.213Z DEBUG] Read cache entry for AllMetadata in memory.
[2022-12-25T21:26:11.213Z DEBUG] Metadata validated. Fetching package source code...
[2022-12-25T21:26:11.214Z DEBUG] Using legacy Git clone to fetch package source at tag: { owner: "mikesol", ref: "v0.9.9", repo: "purescript-deku" }
[2022-12-25T21:26:12.913Z DEBUG] Cloned package source to /tmp/nix-shell.8Ct0ZI/nix-shell.xjJVGX/tmp-6792-bwwsk6lyhsGP/purescript-deku
[2022-12-25T21:26:12.914Z DEBUG] Getting published time...
[2022-12-25T21:26:12.922Z DEBUG] Package downloaded to /tmp/nix-shell.8Ct0ZI/nix-shell.xjJVGX/tmp-6792-bwwsk6lyhsGP/purescript-deku, verifying it contains a src directory...
[2022-12-25T21:26:12.924Z DEBUG] Package contains .purs files in its src directory.
[2022-12-25T21:26:12.924Z INFO] [NOTIFY] Package source does not have a purs.json file. Creating one from your bower.json and/or spago.dhall files...
[2022-12-25T21:26:12.925Z DEBUG] Listing tags for purescript/package-sets
[2022-12-25T21:26:12.925Z DEBUG] Read cache entry for GET /repos/purescript/package-sets/tags in memory.

We can use this log as a little tour through the new effects (you can see all of the effects in the Registry.App.Effects directory).

Registry At the beginning of the log we are Reading metadata for deku. This comes from the Registry effect, which handles interacting with registry resources like reading, writing, and deleting manifests, reading and writing metadata, reading and writing package sets, mirroring legacy content, and so on. In user code you generally just have to write readMetadata "deku" when you need metadata, and the mechanics of retrieving that data are abstracted away.

Git Next, the log reports that the Registry repo up to date. This comes from the Git effect for pulling, committing, and so on. I've taken great care to make our git operations reliable, including verifying regularly that we are in sync with the origin (or are ahead if we're pushing, or are behind if we're pulling, and so on). The main handler for the Registry effect relies on the Git effect to make sure we have registry files available in a local Git checkout, and the Git effect makes sure that we haven't fallen out of sync.

Repositories are now configurable! You can choose to point at a fork of the registry, for example. Everything else will work the same.

TypedCache Next, we Read cache entry for AllMetadata in memory. This comes from the TypedCache effect. This is the most intimidating of the new effects, because I've made the cache well-typed: when you get or put a value, you receive the actual type you wanted, not JSON that you must decode.

A typed cache is super important for a few reasons. First, our use of cache keys was totally uncontrolled and we could easily run into collisions; this is way harder to do when the values are typed. Second, you can't cache a ManifestIndex or other large structure in memory when the cache is untyped, because you have to serialize it on write and deserialize on read and that's too slow. (Or you can coerce it, which is unsafe.). Third, deciding on a single serialization format means the cache isn't flexible enough to accommodate storing most things in JSON, but other things as, say, a Buffer (e.g. tarball) on the file system.

The TypedCache effect supports many independent caches, which means individual caches can be handled separately and code outside the registry proper (like the scripts directory) can still use the typed cache. For example, the Storage effect uses the cache to store tarballs on the file system. The Registry effect, on the other hand, uses the cache to store manifests and metadata in memory. It never stores any cache data on the file system because we rely on the Git repositories as the source of truth instead. Other caches like the Importer cache use an in-memory cache that falls back to a filesystem cache on cache miss.

Notify That [NOTIFY] message indicates a human-readable message we want to report to users. It's handled by the Notify effect, and when running scripts like the legacy importer it just writes to the console. An API could push a message over a websocket, and on GitHub we comment this message on the relevant issue.

GitHub Towards the end we have Listing tags for purescript/package-sets. This comes from the GitHub effect, which is for interacting with the GitHub API. We really need to start making use of conditional requests with this effect — I did more digging, and the If-Last-Modified header doesn't work, but the If-None-Match header given an etag does work. One of us ought to modify the Octokit.js bindings so that we can read the headers from responses.

The small snippet of the logs doesn't cover all the effects. Here are the others:

Storage The Storage effect manages interactions with the storage backend, such as uploading, downloading, and deleting package tarballs. It has handlers for read-only situations (like the legacy importer dry-run) and for the S3 backend.

Pursuit The Pursuit effect manages interactions with Pursuit, like publishing packages. It's a tiny effect for now, but it will grow when we replace purs publish with our own implementation, and I think there are some wonderful opportunities to rewrite Pursuit in PureScript and integrate it more closely with the registry.

PackageSets The PackageSets effect is for upgrading the package sets. It provides upgradeAtomic and upgradeSequential.

Env The Env effect is for providing Reader environments. This is only lightly used, mainly to provide access to things like the pacchettibotti keys or the github event data.


OK, so we've got a set of effects, and handlers for those effects. Why have so many files changed? First, when you replace the foundation of the app it's going to affect everything. Second, now that we can be precise with effects, a lot of massive functions in the app are much better off being refactored to be more modular (like the API). Finally, we have historically had the assumption that the whole registry is oriented around GitHub events, and that is no longer true. We can't just assume in the middle of our API pipeline that we can look up if a username is on the trustees team to re-sign a payload, for example.

That means we have a big change: the API module no longer anticipates that it's being called as part of a GitHub event. Instead, the main operations (publish, etc.) know they can be called from a script (like the legacy importer) or from a payload coming over the wire in the upcoming HTTP API, or from a GitHub event. Other operations (packageSetUpdate) know they must come from a GitHub event because we rely on GitHub for authentication.

Since the API module no longer presupposes GitHub events, we now have a Main module. This module handles all the GitHub stuff: it decodes the operation from the GITHUB_EVENT_DATA, re-signs the payload if the GitHub event is an authenticated event sent by a trustee, and does other GitHub-specific initialization before passing things off to the API module.

This lays the groundwork for the HTTP API, which no longer has to work around GitHub assumptions.

Next Steps

Once we merge this pull request I believe we only need to do one thing: The .cache dir is now part of the scratch directory where all other files get put as part of the registry processes. We need to update the GitHub workflows on the registry repo to make sure that directory is being cached.

Otherwise, when this merges we just need to run the legacy import and make sure everything works as expected. I've tested the code thoroughly and I feel confident, but you never know when you actually deploy whether some small configuration option is off and needs to be tweaked.

I'll leave some comments in the code to aid review.

thomashoneyman commented 1 year ago

If you haven't reviewed this in a while, I'd recommend starting by running registry-importer dry-run and looking at the log file.

thomashoneyman commented 1 year ago

I've now verified that this branch fully works in the registry import / package set updater workflows. See: