Closed thomashoneyman closed 1 year ago
If you haven't reviewed this in a while, I'd recommend starting by running registry-importer dry-run
and looking at the log file.
I've now verified that this branch fully works in the registry import / package set updater workflows. See:
IT'S A CHRISTMAS MIRACLE!
Yes, at long last, I've finished the
app-effects
pull request. This introduces structured effects to the registry. Along the way I found and fixed a lot of little bugs, added lots of logging and (much better) caching, and set the foundation for new features like the HTTP API.For this PR I suggest you try the code before you review it. If you have a valid cache, then you can run the full legacy import (including mock publishing) in about 4 minutes on my machine. Without the cache it a while to build missing manifests, but subsequent runs are fast. The legacy importer is broken on
master
right now, so a dry run will "publish" a few packages.Notably, you only need to have a valid
GITHUB_TOKEN
environment variable for a dry run. None of the rest are even read from the environment.One of the first things you'll notice is that logs are written to the console (with colored output). By default non-debug logs are written, but you can choose the verbosity that suits your task. We also print a link to a logfile containing the full debug logs. Logfiles have names like
legacy-importer-2022-12-25T21:25:27.log
and are created by theLog
effect if you use the file system handler. They contain timestamped output like this:We can use this log as a little tour through the new effects (you can see all of the effects in the
Registry.App.Effects
directory).Registry At the beginning of the log we are
Reading metadata for deku
. This comes from theRegistry
effect, which handles interacting with registry resources like reading, writing, and deleting manifests, reading and writing metadata, reading and writing package sets, mirroring legacy content, and so on. In user code you generally just have to writereadMetadata "deku"
when you need metadata, and the mechanics of retrieving that data are abstracted away.Git Next, the log reports that the
Registry repo up to date
. This comes from theGit
effect for pulling, committing, and so on. I've taken great care to make our git operations reliable, including verifying regularly that we are in sync with the origin (or are ahead if we're pushing, or are behind if we're pulling, and so on). The main handler for theRegistry
effect relies on theGit
effect to make sure we have registry files available in a local Git checkout, and theGit
effect makes sure that we haven't fallen out of sync.Repositories are now configurable! You can choose to point at a fork of the registry, for example. Everything else will work the same.
TypedCache Next, we
Read cache entry for AllMetadata in memory
. This comes from theTypedCache
effect. This is the most intimidating of the new effects, because I've made the cache well-typed: when youget
orput
a value, you receive the actual type you wanted, not JSON that you must decode.A typed cache is super important for a few reasons. First, our use of cache keys was totally uncontrolled and we could easily run into collisions; this is way harder to do when the values are typed. Second, you can't cache a
ManifestIndex
or other large structure in memory when the cache is untyped, because you have to serialize it on write and deserialize on read and that's too slow. (Or you can coerce it, which is unsafe.). Third, deciding on a single serialization format means the cache isn't flexible enough to accommodate storing most things in JSON, but other things as, say, aBuffer
(e.g. tarball) on the file system.The
TypedCache
effect supports many independent caches, which means individual caches can be handled separately and code outside the registry proper (like thescripts
directory) can still use the typed cache. For example, theStorage
effect uses the cache to store tarballs on the file system. TheRegistry
effect, on the other hand, uses the cache to store manifests and metadata in memory. It never stores any cache data on the file system because we rely on the Git repositories as the source of truth instead. Other caches like theImporter
cache use an in-memory cache that falls back to a filesystem cache on cache miss.Notify That
[NOTIFY]
message indicates a human-readable message we want to report to users. It's handled by theNotify
effect, and when running scripts like the legacy importer it just writes to the console. An API could push a message over a websocket, and on GitHub we comment this message on the relevant issue.GitHub Towards the end we have
Listing tags for purescript/package-sets
. This comes from theGitHub
effect, which is for interacting with the GitHub API. We really need to start making use of conditional requests with this effect — I did more digging, and theIf-Last-Modified
header doesn't work, but theIf-None-Match
header given anetag
does work. One of us ought to modify theOctokit.js
bindings so that we can read the headers from responses.The small snippet of the logs doesn't cover all the effects. Here are the others:
Storage The
Storage
effect manages interactions with the storage backend, such as uploading, downloading, and deleting package tarballs. It has handlers for read-only situations (like the legacy importer dry-run) and for the S3 backend.Pursuit The
Pursuit
effect manages interactions with Pursuit, like publishing packages. It's a tiny effect for now, but it will grow when we replacepurs publish
with our own implementation, and I think there are some wonderful opportunities to rewrite Pursuit in PureScript and integrate it more closely with the registry.PackageSets The
PackageSets
effect is for upgrading the package sets. It providesupgradeAtomic
andupgradeSequential
.Env The
Env
effect is for providingReader
environments. This is only lightly used, mainly to provide access to things like the pacchettibotti keys or the github event data.OK, so we've got a set of effects, and handlers for those effects. Why have so many files changed? First, when you replace the foundation of the app it's going to affect everything. Second, now that we can be precise with effects, a lot of massive functions in the app are much better off being refactored to be more modular (like the API). Finally, we have historically had the assumption that the whole registry is oriented around GitHub events, and that is no longer true. We can't just assume in the middle of our API pipeline that we can look up if a
username
is on the trustees team to re-sign a payload, for example.That means we have a big change: the
API
module no longer anticipates that it's being called as part of a GitHub event. Instead, the main operations (publish
, etc.) know they can be called from a script (like the legacy importer) or from a payload coming over the wire in the upcoming HTTP API, or from a GitHub event. Other operations (packageSetUpdate
) know they must come from a GitHub event because we rely on GitHub for authentication.Since the API module no longer presupposes GitHub events, we now have a
Main
module. This module handles all the GitHub stuff: it decodes the operation from theGITHUB_EVENT_DATA
, re-signs the payload if the GitHub event is an authenticated event sent by a trustee, and does other GitHub-specific initialization before passing things off to the API module.This lays the groundwork for the HTTP API, which no longer has to work around GitHub assumptions.
Next Steps
Once we merge this pull request I believe we only need to do one thing: The
.cache
dir is now part of thescratch
directory where all other files get put as part of the registry processes. We need to update the GitHub workflows on theregistry
repo to make sure that directory is being cached.Otherwise, when this merges we just need to run the legacy import and make sure everything works as expected. I've tested the code thoroughly and I feel confident, but you never know when you actually deploy whether some small configuration option is off and needs to be tweaked.
I'll leave some comments in the code to aid review.