purescript / registry-dev

Development work related to the PureScript Registry
https://github.com/purescript/registry
95 stars 80 forks source link

Proposal: authenticated publish endpoint #672

Closed thomashoneyman closed 9 months ago

thomashoneyman commented 9 months ago

The registry currently offers a single endpoint for publishing packages, /publish. Anyone can send a package name, location, and ref to publish the package. The registry will fetch the source, validate, it, publish, and push the docs to Pursuit.

When the registry fetches the source it will also pick up the manifest file for the package. The registry supports a purs.json file and it also has first-class support for a spago.yaml file. It has first-class support for spago.yaml so that Spago users don't have to generate and commit a purs.json file.

It's important that users don't have to commit an extra file because, as we've seen with the old spago.dhall / bower.json pairs, the file that is generated for publishing almost inevitably falls out of sync with the primary package manager file. In the publishing context this means either publishing with an unexpected, out-of-date manifest, or failing altogether and having to tag a new version just to fix the out-of-sync manifest.

We don't want the registry to have to special case a bunch of different potential package manager file formats. It was a stretch to even special-case the spago.yaml format. But I think it's a shame that package managers other than Spago will be stuck having to generate and commit a purs.json file as part of publishing. Instead, I'd like to propose an alternative: an authenticated /publish endpoint which allows you to send a manifest in the publish payload, rather than requiring it exist in the source code.

Authenticated Publish

The authenticated publish endpoint would be the same as the normal endpoint, with a new field: manifest.

type AuthPublishData =
  { name :: PackageName
  , location :: Maybe Location
  , ref :: String
  , compiler :: Maybe Version
  , resolutions :: Maybe (Map PackageName Version)
  -- NEW
  , manifest :: Manifest
  }

The purpose of this endpoint is for package managers to publish packages without having to commit a purs.json file and without the registry having to special-case their particular manifest format. Instead, so long as they can generate a proper Manifest then the registry can accept their code.

The reason the endpoint has to be authenticated is because we consider the non-authenticated API untrusted input, and we want users to have full control over the contents of their manifests. An arbitrary user shouldn't be able to just send a manifest to the registry and publish it.

Internal Changes

The new endpoint requires almost no changes to the registry. We would simply authenticate, then go through the normal publish workflow, except that if we already have a Manifest then we don't read one from disk. (Or we throw an exception if you sent a manifest, and have one committed, and they don't match.) Similar to how we branch for spago.yaml files.

Considerations

The major wrinkle to this proposal is that we only support a single authentication mechanism — SSH signatures, with public keys recorded in the owners field of the manifest and then copied to metadata. There is no way a package can be registered via the authenticated publish endpoint since we can't authenticate the publisher until we have a manifest published the ordinary way.

Still, this reduces the purs.json burden to just your first-time publish, after which it can be deleted. Alternately, we can consider relaxing purs.json parsing such that we read only the owners key in the case the authenticated publish was used, for verification, and then defer to their provided manifest otherwise. This means a purs.json file has to be maintained still, but only for the sake of listing owners, which is a rare change. (I'm in favor of this.)

For GitHub users specifically we can consider even more automatic methods; for example, all GitHub users have their public SSH keys at their username + .keys, such as mine: https://github.com/thomashoneyman.keys; for repositories owned by a particular user we could look at their public SSH keys to verify the signature without needing anything to be committed. This would strictly be in addition to other methods, not in place of them, since we are platform-agnostic.

artemisSystem commented 9 months ago

Ooh! As someone who is maybe interested in making her own package manager someday, this seems like a pretty good solution! And it can also be improved on later if there are ideas on how to make it better/smoother also, for example for first publishes

f-f commented 9 months ago

We don't want the registry to have to special case a bunch of different potential package manager file formats. It was a stretch to even special-case the spago.yaml format. But I think it's a shame that package managers other than Spago will be stuck having to generate and commit a purs.json file as part of publishing. Instead, I'd like to propose an alternative: an authenticated /publish endpoint which allows you to send a manifest in the publish payload, rather than requiring it exist in the source code.

While this may at first seem like a good idea, it's exactly the kind of situation that we're running away from with this whole registry thing. Before this, the only way to "publish" a package in PureScript so that you'd get names resolved and docs hosted was to go through purs publish, which would craft a specific, out-of-band (i.e. not versioned, and hard to come up with) payload. We're still trying to disentangle away from that, and yet this proposal wants to spec the ability to be able to publish things with more out-of-band payloads.

The main problem with out-of-band payloads is that if you ever need to process all the packages in a standard way (as we are doing here for example, we might need to do it again), then you don't have that information in the source - as it's the case so far with the Bowerfiles, old spago files, and so on - so it's easy to lose these payloads.

I don't think it was a stretch to standardise the spago.yaml, and I instead think that's the direction we should go if we want to add support for more config files: let's make the registry this standardised pipeline where we stick all the decoders from config format X to purs.json (this is in line with your proposal to make the spago.yaml parsing independent from the Spago codebase, which is good in this direction). So as long as the registry can read it, and it's versioned in the source, then it's good to go.

The purs.json is by necessity a compromise format - since it's meant to be the standard way for the registry to get info for publishing, it has to be minimal, so it's unusable for package managers (we figured this one out about two years ago, where the first draft of the registry had a much richer format). The reason why we required it to be included in the source is so that one would have a clear format to publish things in, which we didn't have before. We can extend this to be one step removed, so instead of "you have to include a purs.json in the source" we can do "the registry has a decoder to figure out a purs.json from your source". We arguably already have that - with the whole legacy import - but it's not clearly specced yet.

artemisSystem commented 9 months ago

What about generating a purs.json from the manifest on publish, and adding it to the tarball? This could be done with spago.yaml packages also. That way, every package in the registry would have a purs.json in its published source. Or is there a problem with that approach?

The reason i suggest this, is that as a package manager author, it would be an annoying barrier to have to be "blessed" by the registry in order to publish packages without maintaining a purs.json

f-f commented 9 months ago

What about generating a purs.json from the manifest on publish, and adding it to the tarball? This could be done with spago.yaml packages also. That way, every package in the registry would have a purs.json in its published source. Or is there a problem with that approach?

This is where things stand at the moment, and the package manager is responsible to make sure the purs.json is in version control.

Though, as Thomas noted in the first message, it's easy to have the two formats (the custom one for the package manager and the purs.json) get out of sync, which can be inconvenient. So we ended up special casing the spago.yaml (as we already special case the bower.json and the spago.dhall).

For these special-cased packages we always include a purs.json in the tarball, because we always generate one. I am proposing that we just open up the special casing to whatever format we can get a decoder for.

f-f commented 9 months ago

Re this:

it would be an annoying barrier to have to be "blessed" by the registry

The registry has a whole set of very annoying checks for getting packages in - having to merge a decoder for your config file in the registry codebase is a pretty mild one, and it's not even compulsory (since you could always generate a purs.json instead)

f-f commented 9 months ago

An additional note on the proposal:

having to tag a new version just to fix the out-of-sync manifest

We are dealing with tags only temporarily until we fix/bypass purs publish - the registry is able to deal with single commits, which definitely lowers the annoyance factor of "I forgot to push the purs.json". The package manager could in fact automate this. If you already tagged a git version for other reasons then you don't have to redo the tag just to include the purs.json, since the registry won't need a tag at all.

artemisSystem commented 9 months ago

What about generating a purs.json from the manifest on publish, and adding it to the tarball? This could be done with spago.yaml packages also. That way, every package in the registry would have a purs.json in its published source. Or is there a problem with that approach?

This is where things stand at the moment, and the package manager is responsible to make sure the purs.json is in version control.

Oh, sorry, i wasn't clear. My suggestion was that, any package manager can send a manifest with the payload, and the registry would generate a purs.json from that, and put it in the tarball. The source on github or whatever would not have a purs.json. That way, it could never get out of sync with a package manager's manifest, as long as a correct manifest for the registry is generated by the package manager on publish.

f-f commented 9 months ago

the registry would generate a purs.json from that, and put it in the tarball.

This also implies that the decoders for the config format are running in the Registry, which is exactly what I am proposing, with the additional requirement that the manifest is included in the source instead of the API payload.

I don't quite understand why passing things out-of-band seems so appealing - if we want to pass things out-of-band we might as well just allow people to upload tarballs themselves. But avoiding exactly that has been one of the main principles when designing the Registry - if we want to change this principle we are going to need a better reason than "it's mildly inconvenient to generate the purs.json file and commit it"

thomashoneyman commented 9 months ago

the registry would generate a purs.json from that, and put it in the tarball.

This also implies that the decoders for the config format are running in the Registry, which is exactly what I am proposing, with the additional requirement that the manifest is included in the source instead of the API payload.

It's the opposite: passing a Manifest via the API means the registry doesn't need to have decoders for any config formats. We take the manifest and write it as a purs.json file in the tarball without any special knowledge required.

I don't quite understand why passing things out-of-band seems so appealing - if we want to pass things out-of-band we might as well just allow people to upload tarballs themselves.

I don't think this is quite fair; passing a manifest is much different than accepting an arbitrary tarball for upload. We still run exhaustive checks on the source code, prune files, detect licenses, and so on.

The appeal for passing a manifest in particular is to avoid special-casing package managers. I can envision scenarios where it's not convenient to write a decoder — even spago.yaml requires the library to bring in a JS yaml library we wouldn't otherwise use, and other package managers can use formats that are even harder for the registry to deal with, whether it's a TOML file or Dhall or something Nix-based. Or where a registry decoder is insufficient to produce an accurate manifest, like the many problems we've had working with spago.dhall files. We could remove the special-casing for spago.yaml, even.

The purs.json is by necessity a compromise format - since it's meant to be the standard way for the registry to get info for publishing, it has to be minimal, so it's unusable for package managers (we figured this one out about two years ago, where the first draft of the registry had a much richer format). The reason why we required it to be included in the source is so that one would have a clear format to publish things in, which we didn't have before. We can extend this to be one step removed, so instead of "you have to include a purs.json in the source" we can do "the registry has a decoder to figure out a purs.json from your source". We arguably already have that - with the whole legacy import - but it's not clearly specced yet.

I don't think of a Manifest (purs.json) as a compromise format so much as a universal format — it's the file the registry guarantees will be in every tarball to describe the package and its dependencies. It's simple enough for all package managers to be able to read. Accordingly, we already essentially require all package managers to be able to work with the Manifest type (otherwise they can't use the registry).

Instead of the registry having to know how to decode everyone's custom formats, I think we should have package managers have to work with the registry format. Instead of "you have to include a purs.json in the source or the registry has implemented a decoder for your format" we say "you have to include a purs.json in the source or send it to the registry in a signed payload." Then the registry is only committed to supporting the Manifest type, and a package manager's job when integrating with the registry is to support it as well.

I've spent more time on the legacy importer than anything else in the registry and honestly it's terrible dealing with the quirks of the few formats we're parsing into Manifests. In some cases it's not even possible to know we've made a correct Manifest and we just do a best guess. I would much rather not have to deal with custom formats.

The main problem with out-of-band payloads is that if you ever need to process all the packages in a standard way (as we are doing here for example, we might need to do it again), then you don't have that information in the source - as it's the case so far with the Bowerfiles, old spago files, and so on - so it's easy to lose these payloads.

I think this is a good criticism of this proposal — if we can't derive a manifest from the package source, then things like the legacy importer don't work anymore. You would have to publish that package yourself. However, I don't think it's a deal breaker for a few reasons:

  1. We do still have the resulting purs.json we go from any published package because it's in the tarball (and registry index). In a mass import, if we want to preserve packages that used the authenticated publish endpoint then we would need to look up their purs.json file in the registry index or tarball instead of source code.
  2. The legacy importer is supposed to be turned off when the registry launches (whenever that is — presumably whenever Spago leaves alpha and can be used to publish packages). We'll need to support custom formats in the importer for as long as it lasts, and I'd be fine implementing decoders there, but I don't think that should guide our design choices for the registry long-term.
f-f commented 9 months ago

The appeal for passing a manifest in particular is to avoid special-casing package managers. I can envision scenarios where it's not convenient to write a decoder — even spago.yaml requires the library to bring in a JS yaml library we wouldn't otherwise use, and other package managers can use formats that are even harder for the registry to deal with, whether it's a TOML file or Dhall or something Nix-based. Or where a registry decoder is insufficient to produce an accurate manifest, like the many problems we've had working with spago.dhall files.

I've spent more time on the legacy importer than anything else in the registry and honestly it's terrible dealing with the quirks of the few formats we're parsing into Manifests. In some cases it's not even possible to know we've made a correct Manifest and we just do a best guess. I would much rather not have to deal with custom formats.

I understand that this effort was hard and painful and a lot of work, but this sounds like we want to make it easier for the registry codebase at the expense of everyone else, which I do not support. If we do not enforce a purs.json in every repo, or do not ship decoders in the registry, then there is no common format in the ecosystem, and package manager writers are left on their own: if you want to download and use a purescript codebase but do not want to rely on the registry tarballs (e.g. this approach is pretty common in nixpkgs and other ecosystems, where everything is built from scratch, or there is a serious attempt to do so), having the assurance that there is a purs.json that you can decode to get dependency information is useful, as it's useful to have a decoder for a whatever custom format of the hypotetical NewPackageManagerX for PureScript.

The perspective that I have is that the Erlang ecosystem is quite fragmented when it comes to build systems, and trying to put together a custom build for your dependencies is a truly horrendous task (you can tell I've been there 😄), since packages could have a rebar3 config, or an erlang.mk config, or a bazel build file, or just a makefile, or whatever else really. None of these are clearly specced so any attempt to decode them is painful at best.

I would like the PureScript ecosystem to look less like the Erlang ecosystem and a bit more like the JS ecosystem, where there's a common config format - the package.json - and multiple package managers that are able to read these, because there's a clear spec and a common base.

thomashoneyman commented 9 months ago

If we do not enforce a purs.json in every repo, or do not ship decoders in the registry, then there is no common format in the ecosystem, and package manager writers are left on their own: if you want to download and use a purescript codebase but do not want to rely on the registry tarballs (e.g. this approach is pretty common in nixpkgs and other ecosystems, where everything is built from scratch, or there is a serious attempt to do so), having the assurance that there is a purs.json that you can decode to get dependency information is useful, as it's useful to have a decoder for a whatever custom format of the hypotetical NewPackageManagerX for PureScript.

OK, I think I was misunderstanding your objections above. It sounds like you're specifically concerned that package source code won't have a standard manifest file, so if you're not getting the packages from the registry then you aren't guaranteed you'll be able to read a manifest from them. You want the registry to require a standard manifest in the source code as an encouragement to the ecosystem to use them in general.

I'm sympathetic to this, but I don't think it's possible. Before I describe why, here are some considerations I have in mind:

  1. We already support two formats explicitly — spago.yaml or purs.json.
  2. No one will use only a purs.json file because it's insufficient for package management on its own, so if we require it then every package will have two manifests, with the purs.json one generated.
  3. It's a headache to generate and commit a separate file when publishing your package, so it will inevitably get out of sync and become incorrect (as we've seen with spago/bower). Requiring a purs.json file means inflicting this on every package.
  4. If we don't require a purs.json file then you have no guarantee that packages in the wider ecosystem use a standard manifest file in their source code. (We've already relaxed this to a purs.json file or a spago.yaml file; and, of course, if a package version wasn't published to the registry then there is no guarantee whatsoever of what manifests it could have in its source code. But I don't think we need to be concerned about non-registry packages.)
  5. If a package version was published to the registry you have a guarantee that a purs.json file is in the tarball and registry-index. It also has either a spago.yaml, purs.json, or both in its code wherever it was fetched from.

Two other points:

  1. I don't see what decoders in the registry buys you unless you're saying that the registry library should support decoders for arbitrary package managers and their file formats, and I certainly don't want to take on that commitment. We don't even do that for spago.yaml.
  2. I don't think ecosystems like nixpkgs are a problem here at all. The tarballs contain the package source code, so you're still building from source if you use registry tarballs. And if you're fetching source repositories via Nix then you either a) already know your resolved tree and don't need to care about the manifest files or b) are doing something impure and un-Nix-like

The perspective that I have is that the Erlang ecosystem is quite fragmented when it comes to build systems, and trying to put together a custom build for your dependencies is a truly horrendous task (you can tell I've been there 😄), since packages could have a rebar3 config, or an erlang.mk config, or a bazel build file, or just a makefile, or whatever else really. None of these are clearly specced so any attempt to decode them is painful at best.

I would like the PureScript ecosystem to look less like the Erlang ecosystem and a bit more like the JS ecosystem, where there's a common config format - the package.json - and multiple package managers that are able to read these, because there's a clear spec and a common base.

The only way the registry can unify the ecosystem behind a single format is to standardize on a single format. That format cannot be purs.json — it has to be, like package.json, sufficient for any package manager to use. We long ago decided not to do this and to introduce purs.json as a minimal shared format for specifying at least your dependencies, and to leave others free to design richer formats for their package managers (like spago.yaml), and to leave the design space open to introduce package managers that do things like install native dependencies along with PureScript ones.

The fight to have a standard manifest format feels like it has already been lost. We already have no guarantee that even a purs.json file will be present in package source code. But we have always guaranteed that registry tarballs have a purs.json file in them that all package managers can use, and I think this is a sufficient guarantee for the registry to make.

This authenticated endpoint proposal just lifts the source code restriction for spago.yaml / purs.json for publishing to the registry. You are always guaranteed a purs.json file in the resulting tarball.

f-f commented 9 months ago
  1. We already support two formats explicitly — spago.yaml or purs.json

I was against explicitly making an exception for spago.yaml files - if that's not good anymore I'm happy to only accept purs.json files in the registry, or explicitly welcome integrations with other formats inside the registry

  1. No one will use only a purs.json file because it's insufficient for package management on its own, so if we require it then every package will have two manifests, with the purs.json one generated.

I am happy to standardize on a purs.yaml instead that is sufficient for package management, a-là package.json

  1. It's a headache to generate and commit a separate file when publishing your package, so it will inevitably get out of sync and become incorrect (as we've seen with spago/bower). Requiring a purs.json file means inflicting this on every package.

I disagree - if the package manager automates this it's not a headache. Stack does this with cabal files.

  1. If we don't require a purs.json file then you have no guarantee that packages in the wider ecosystem use a standard manifest file in their source code.

I am proposing that we either make the purs.json mandatory, or provide decoders (in the form of merging them into the registry) for all these formats. Either way anyone can decode whatever configs we accept.

  1. If a package version was published to the registry you have a guarantee that a purs.json file is in the tarball and registry-index. It also has either a spago.yaml, purs.json, or both in its code wherever it was fetched from.

This does not solve the problem I mentioned above of package managers that bypass our tarball storage.

  1. I don't see what decoders in the registry buys you unless you're saying that the registry library should support decoders for arbitrary package managers and their file formats, and I certainly don't want to take on that commitment. We don't even do that for spago.yaml.

It provides a spec (in the form of code that you can run) of all of the formats that we accept in the registry. We don't need any commitment, any additional spec is the responsibility of the package manager authors (in the same way that I provided the decoder for the spago.yaml files)

  1. I don't think ecosystems like nixpkgs are a problem here at all. The tarballs contain the package source code, so you're still building from source if you use registry tarballs. And if you're fetching source repositories via Nix then you either a) already know your resolved tree and don't need to care about the manifest files or b) are doing something impure and un-Nix-like

My point here is that the tarballs are not the upstream. They are an artifact that we produce as a product of the publishing process. Some ecosystems - again, nixpkgs does this, but also e.g. Gentoo - like to fetch things from their upstream. In the same way that a Java package manager might fetch the sources from the upstream rather than from Maven Central, and so on.

thomashoneyman commented 9 months ago

OK. It sounds like we have some different priorities, where:

Does my paraphrase of your comments sound right to you?

My point here is that the tarballs are not the upstream.

Point taken, that's true.

It provides a spec (in the form of code that you can run) of all of the formats that we accept in the registry.

I'll have to think more about this, but perhaps it's a nice middle ground.

I am happy to standardize on a purs.yaml instead that is sufficient for package management, a-là package.json

I would be happy with this too, except that I don't feel confident specifying it 😆

JordanMartinez commented 9 months ago
  1. It's a headache to generate and commit a separate file when publishing your package, so it will inevitably get out of sync and become incorrect (as we've seen with spago/bower). Requiring a purs.json file means inflicting this on every package.

It sounds like this is the crux of Thomas' concern, and it sounds like there are two issues. First, that it's a "headache" to generate/commit a separate file. Second, that the resulting file goes out-of-sync with the main package manager's config file (e.g. spago.yaml).

So, first, why is this a headache?

AFAICT, the only headache I've found is git tagging prematurely, and only realizing that after the tag has already been pushed to the remote. I think we just need better checks here to prevent that from happening. For example, in https://github.com/purescript/spago/pull/1121, we do report the need to have a tag checked out. However, the new error says to do that once all other errors are resolved. It's not a hard guarantee, but it's better than nothing. That being said, if someone implemented a different package manager, there's no guarantee that they would implement such checks. Moreover, there could always be a check we forgot to do or implemented incorrectly. But even so, if this is the headache, I think it's a solvable problem.

Second, there's an alternative way to store purs.json file's content in source code without storing it as a file in source code. Correct me if I'm wrong, but a purs.json file indicates all the publish info needed for a repo at a particular commit. So, what if we stored this information in a git tag's annotation? In this way, there's no "out-of-sync" issue because that information only exists at a specific point where one wanted to publish the state of the repo at that commit. If the package manager was in control of the git tag, the package manager could do the following:

When the Registry runs, it just looks for the purs.json file content in the tag's annotation and we're done. Moreover, the only people who can add such a tag and annotation are those who have write-access to the repo.

f-f commented 9 months ago

Storing the purs.json in git's metadata will not work if one just fetches a compressed archive of the repo at a certain commit, so it's not really storing things together with the source. As the name implies, it's metadata, not data, and disappears the moment one removes the git layer, which we in fact do ourselves here when fetching from github. And we are going to support more locations other than GitHub, potentially other VCSs, etc.

thomashoneyman commented 9 months ago

Looping back around to our earlier conversation — I am OK with supporting multiple package manager config formats by implementing decoders for them in the registry, but only if we reserve the right to consider some formats too much work and not provide an integration for them. (Those package managers can still generate and commit a purs.json file.)

I still think it would be easier for the registry to just accept a manifest in source or over the API but I understand the objection about manifests potentially not existing in source code.

So, first, why is this a headache?

I've reflected more on why generating and committing a specific file for publishing is so annoying, and your comments here, and I agree it's not as big a deal as I've been making it out to be. The spago/bower combo is terrible because nothing checks the bower file, but of course when generating and committing a purs.json file the registry will reject it if it's invalid — there's some validation in place that the file isn't bogus. And spago provides plenty of pre-publish checks as well that Bower never did. Perhaps it's not such a headache after all.

That said — I still don't like having to generate and commit a second essentially duplicated manifest, so I would rather implement integrations in the registry for package manager config formats rather than require everyone to commit a purs.json file.

thomashoneyman commented 9 months ago

We can continue discussing this, but since the authenticated publish endpoint won't be happening I'll go ahead and close it.