Split up source metadata into multiple files

bgamari commented 11 months ago

Currently ghcup metadata maintenance is one of the more manual (and consequently error-prone) aspects of cutting a GHC release. Specifically, it involves manually adding a snippet to the 1000+ LoC metadata file and then carefully editing the YAML anchors of said snippet to ensure that they are globally unique.

It seems to me that this process could be streamlined by splitting the metadata into individual files which can be combined into the final monolithic metadata file by CI (e.g. when merging to master). For instance, one might imagine this repository consisting of a directory structure like:

metadata
  + ghc
  |    + 9.4.5.yaml
  |    + 9.4.5.yaml.asc
  |    + 9.6.3.yaml
  |    + 9.6.3.yaml.asc
  + cabal
  |    + 3.2.0.0.yaml
  |    + 3.2.0.0.yaml.asc
...

This, of course, poses the problem of ensuring that the final metadata is signed. I can see at least three approaches that might be used here:

Approach A: Teach CI to verify each of the signatures of the individual per-version metadata files and apply its own signature (using its own key) to the final metadata.
Approach B: Teach ghcup itself to distribute the individual metadata files (e.g. via a cabal-style tar archive) and validate each individually.
Approach C: Rework the signature scheme to instead sign a canonicalized representation of the per-version metadata (e.g. teach the verification scheme to render the metadata as Canonical JSON and verify the signature with respect to that representation)

hasufell commented 11 months ago

Approach A: Teach CI to verify each of the signatures of the individual per-version metadata files and apply its own signature (using its own key) to the final metadata.

I don't believe in machine keys, especially on Github. They are not secure.

Approach B: Teach ghcup itself to distribute the individual metadata files (e.g. via a cabal-style tar archive) and validate each individually.

That's possible, but will make it more complicated from the code perspective. It may also regress the speed, which I have been monitoring extensively (timing of running ghcup list).

Approach C: Rework the signature scheme to instead sign a canonicalized representation of the per-version metadata (e.g. teach the verification scheme to render the metadata as Canonical JSON and verify the signature with respect to that representation)

I'm not sure I understand. Will GHCup download multiple yaml files with this? CI will combine them?

Generally, I'm not a fan of complicating the format to make the workflow easier.

The metadata is a shared format. HLS and GHC are connected, stack and GHC are connected.

I also don't see how using separate files will improve the workflow exactly.

chreekat commented 11 months ago

Currently ghcup metadata maintenance is one of the more manual (and consequently error-prone) aspects of cutting a GHC release. Specifically, it involves manually adding a snippet to the 1000+ LoC metadata file and then carefully editing the YAML anchors of said snippet to ensure that they are globally unique.

This problem sounds like it would be straightforward to automate.

chreekat commented 11 months ago

But maybe I'm being naive? Is there something that makes this different from other kinds of structured-data editing?

bgamari commented 11 months ago

@chreekat in principle it is no different. However, in practice editing YAML while preserving anchors, which the Metadata makes heavy use of, is not easy. I do not know of a non-event-style YAML parser/printer library which exposes (or even preserves) anchor information. Splitting the metadata is an easy way to side-step this problem.

hasufell commented 11 months ago

However, in practice editing YAML while preserving anchors, which the Metadata makes heavy use of, is not easy.

I don't particularly follow. What exactly is problematic? Do you have an example?

CI checks for broken or duplicate YAML anchors.

chreekat commented 11 months ago

Since it's machine-generated yaml, couldn't we stop producing anchors anyway?

hasufell commented 11 months ago

Since it's machine-generated yaml, couldn't we stop producing anchors anyway?

Since when is it machine generated?

Yaml anchors serve no purpose other than making editing easier. When you change a bindist, you do it in only one place. So I don't even understand the argument here.

The yaml has been maintained by a human (me) for years. If anyone is using scripts or machines to generate yaml snippets, then they're free to do so.

And yet, bindist URLs may be changed/fixed manually in certain cases. The anchors aid with that.

This format was specifically chosen to make editing by hand easier. I don't believe splitting into multiple files, introducing Dhall or various CI generation steps are really improving the workflow.

The metadata has to be reviewed carefully and allow for end users to read it and maintainers to easily modify (by hand).

bgamari commented 11 months ago

I don't particularly follow. What exactly is problematic?

It is yet another manual step which requires care. In general we are trying to eliminate these.

To be clear, I don't argue that we should eliminate the use of anchors. They make the structure of the file clearer. However, even orienting oneself in what will soon be a multi-throusand-line YAML file is becoming difficult; this problem will only continue to grow in time. Manually editing a file that is unbounded in length seems like it will inevitably become burdensome, even if we didn't find it so today.

hasufell commented 11 months ago

Manually editing a file that is unbounded in length seems like it will inevitably become burdensome, even if we didn't find it so today.

I believe stack maintainers have been doing that since the beginning with no issues: https://github.com/commercialhaskell/stackage-content/blob/master/stack/stack-setup-2.yaml

@mpilgrem opinions?

mpilgrem commented 11 months ago

I manually edit stack-setup-2.yaml, but that may be an easier job as I only have final releases of GHC to worry about and on a smaller set of platform variants (about 11). A lot of it is 'copy-paste': e.g. adding '9.4.7' starts with a copy-paste of '9.4.6' (EDIT: relying on VS Code search to list all the 9.4.6's), changing '9.4.6' to '9.4.7', deleting size/hashes and then re-populating sizes and hashes. EDIT: I then check and re-check my typing.

phadej commented 11 months ago

Silly question: Why not just cat $(ls *.part.yaml | sort) > ghcup-0.0.7.yaml where the cat spell could be a bit more involved (e.g. auto-indent parts, order them in semantic order etc.)

EDIT: theh @mpilgrem's manual process would start with cp ghc-9.4.6.part.yaml ghc-9.4.7.part.yaml; $EDITOR ghc-9.4.7.part.yaml. IMHO that is a lot find&replace friendly, as you know you won't edit anything else.

hasufell commented 11 months ago

My workflow is essentially the same. Since releases happen every few months, I also haven't seen the need to automate it. It doesn't happey daily or weekly.

The main use case for me is manual editing.

phadej commented 11 months ago

Also if final artifact is always auto-generated, the anchors are only strictly necessary in inputs, so AFAICS the output could even be .json.

Or is there a use-cases for anchors spawning different GHC releases (or different tools?), i.e. not self-contained in single YAML file?

If no, then YAML-file-including mechanism can be also used instead of cat. Embracing YAML to the maximum potential.

chreekat commented 11 months ago

Sort of a naive question here, since I haven't looked into it yet, but I wanted to bring up one use case that might benefit from separate files.

I think a more robust design for GHC Nightlies would independently update the channel for separate platforms. If such independent pipelines only had to worry about their own output file, it might be simpler to implement. If it doesn't happen in GHCup, then it will have to happen in GHC CI. I can do the work either way, but doing it GHCup seems like it would have additional benefits.

hasufell commented 11 months ago

I think a more robust design for GHC Nightlies would independently update the channel for separate platforms. If such independent pipelines only had to worry about their own output file, it might be simpler to implement.

This is also simple to implement if a python script outputs the entire thing and only adds the platforms for which the jobs succeeded.

I can do the work either way, but doing it GHCup seems like it would have additional benefits.

At the moment I have very low appetite to be involved in nightlies issues again, unless I see significant improvement from GHCs side. So I'm happy it isn't directly a GHCup issue.

bgamari commented 11 months ago

This is also simple to implement if a python script outputs the entire thing and only adds the platforms for which the jobs succeeded.

Unfortunately I don't believe that is true as PyYAML does not expose control over anchor names, again.

For what it's worth, the need to maintain both ghcup-x.y.z.yaml and ghcup-vanilla-x.y.z.yaml would also be eliminated these files were generated.

haskell / ghcup-metadata

Split up source metadata into multiple files #134