[Bundler] RFC Proposal: Machine-readable output for `update` command

High-level proposal

Add a new flag to the update command to produce machine-readable output, for better integration of Bundler with tooling. EG:

$ bundle update --machine-readable
[
  {
    "name": "nokogiri",
    "old": "1.13.6",
    "new": "1.13.8"
  },
  {
    "name": "rails",
    "old": "6.1.6",
    "new": "7.0.3.1"
  },
  {
    "name": "tzinfo",
    "old": "2.0.4",
    "new": "2.0.5"
  }
]

Motivation

It would be really useful to have machine-readable output from the update command.

This would make it easier for tooling (CI/CD, security inspections, automated updaters, posting summary messages) to consume & understand the changes made by dependency updates. Currently the only options are parsing the human-readable output from bundle update (which is not guaranteed to stay the same) or diffing the lockfile (which also doesn't guarantee its API, I don't believe?).

For example, I have a Github action that runs daily and checks for Gem updates on my projects. If there are updates, it creates a PR, and writes a summary in the PR text of the change it's making. I could make that summary easier to read if the output from bundle update was more structured. I could also potentially add a feature to auto-merge any update that only has minor version updates and passes tests. I could add an action to post to Slack when major updates are pending, so I know to check changelogs and see if I need to make any changes before updating.

This is a sister issue to https://github.com/rubygems/rubygems/issues/5913, but whereas that would make it easier for humans who are using bundle update directly to understand what has changed, a machine-readable output would make it easier for similar more useful summaries to be produced by automated systems that invoke bundle update.

A machine-readable option would also make it easier for users who are using bundle update directly, but have different requirements or opinions to the solution in https://github.com/rubygems/rubygems/issues/5913, to produce output summaries in a format or slicing of their choosing.

Key Challenges

Figuring out the changeset would already be required for https://github.com/rubygems/rubygems/issues/5913, and I believe that part is relatively easy: we have sufficient metadata hanging around after doing an update to understand which gems changed and what the old and new versions are.

Producing machine-readable output from that data is also relatively easy. JSON seems like a logical choice. Apparently we cannot use the json library? But producing simple json by hand is not hard, so that is not a blocker. YAML is another option, but more tooling uses JSON generally speaking (e.g. jq) and it has fewer footguns both for writing and parsing. Another perfectly decent option would be to produce column output, CSV, or some other simpler format, but I think a more structured format such as json or yaml will make it easier to extend in the future without breaking consumers.

The hardest part would be only producing the machine-readable output, and that's why I am opening an issue to discuss the RFC rather than going straight to creating an RFC.

A lot of different parts of code can and do print to stdout in Bundler, and it would not be trivial to add a flag to bundle update that makes all the output machine readable.

Potential approaches to producing only machine-readable output

There is potentially an easy, albeit quite horrible, way to get machine-readable output that won't have other human-readable messages dropped into the middle of it. We could define some special message e.g. "MACHINE-READABLE-STARTS-HERE" that will be output at the very end, after all human-readable output, followed by the machine-readable summary. This will add an extra hurdle to anyone wanting to consume it: rather than being able to write e.g. bundle update --machine-readable | jq . they would have to do something like bundle update --machine-readable | sed -e '1,/MACHINE-READABLE-STARTS-HERE/ d' | jq ..

The next easiest, and far nicer method for consumers, would be to carefully pass the option to suppress human-readable messages down through every code path that an update calls that might print out a message. This goes a little deep in places but would be somewhat feasible, although not massively tidy — not all of the code passes down options or context, so that will need to be added in places. This, however, might be fragile: new output messages might be added to other functions in the critical path of update which would then break the machine-readable output.

Conceptually the nicest way to handle this would be to upgrade the message output framework itself in Bundler to have two modes. As all output is already mediated through a single output framework within Bundler, this would then mean that a single switch for human/machine could reliably capture all messages, even those added in the future. This would require substantial changes to that framework, though. But with this method we probably have the best chance to not just capture the summary of version changes, but any other messages relevant to the update. It would also be the most robust and least likely to break.

Key questions

In my mind the two key questions here are:

1) Would we want to make all output machine readable (including warnings about gems that failed to update, warnings about gems that went down in a version, post-install messages, messages about not updating particular groups, messages about platform mismatches, etc, etc), or would we settle for just the info on version changes?

If we do want to capture e.g. warnings and all, that pushes me more towards the larger change of updating how messages are output, so these can all be captured — otherwise it will require a lot of changes on an ad-hoc basis that I think will be more work in the long run, and fragile.

If there is less value in anything except version updates, maybe the amount of changes needed for producing only those without reworking how messages are printed isn't too high.

2) Would there be any interest in making other Bundler commands have machine-readable output?

I haven't yet spent any time thinking about whether there might be value for tooling and other applications in other Bundler commands also having a stable, machine-readable interface. If there is, that would be another reason to change the output handling, so this could be done more easily and reliably on other commands as well.

If we think that only the update command would benefit from machine-readable output, then there's less value in reworking how messages are printed in general throughout Bundler.

Answers to these questions will help inform the big decision of how to handle the outputting problem.

This is a very detailed and thoughtful proposal, so what I suggest may come off as a bit trite.

What I'd propose as an alternative is (1) to add support for CycloneDX SBOMs, so that (2) folks can use cyclonedx diff to generate diffs between two different versions of a gem.

My concern here is that proposed approach sails perilously close to producing a new, gem-specific SBOM format in disguise, which would hamper adoption by generalised tooling (SCA tools, etc) that are developed outside of the Ruby ecosystem. By using CycloneDX and its diffing capability, I think your requirement to be able to find changes between gem versions would be served without needing a new format to be defined.

so what I suggest may come off as a bit trite.

Not at all! If there's a way to achieve these goals with less over-engineering, so much the better!

to generate diffs between two different versions of a gem

I want to double-check: would this still work for projects, as well as gems? For example I have a Ruby on Rails application, it isn't built as a gem and doesn't ship as one, but I still want to be able to diff my Bundler updates on that project. Would your proposed approach still over that use-case? (I think the answer is yes, but I want to double-check as I'm not familiar with it)

My concern here is that proposed approach sails perilously close to producing a new, gem-specific SBOM format in disguise, which would hamper adoption by generalised tooling (SCA tools, etc) that are developed outside of the Ruby ecosystem. By using CycloneDX and its diffing capability, I think your requirement to be able to find changes between gem versions would be served without needing a new format to be defined.

I think this is really reasonable!

I'm not familiar with CycloneDX, is there tooling available to easily produce summaries of the diffs sliced by major/minor, or top-level vs implicit dependencies? Would it be easy to answer questions like "does this update bump any major versions?"

My feeling right now is that this is maybe a complimentary thing, or maybe a different thing that removes the need for my suggestion, and I'm not sure which. I think it would be a lot easier on the Bundler side to produce a BOM XML file from a lockfile as a new command than it would be to re-jig an existing command to have machine-readable output. I think from the tooling side, the standardised BOM option is probably better for larger shops with multiple languages, but the json or yaml on stdout approach is probably easier for smaller tooling jobs?

I think the SBOM approach could be its own proposal potentially. And maybe we'd still want machine-readable output for Bundler commands in general for other reasons, but maybe this is the only Bundler command we want it for, in which case maybe the SBOM fits the need.

For example I have a Ruby on Rails application, it isn't built as a gem and doesn't ship as one, but I still want to be able to diff my Bundler updates on that project. Would your proposed approach still over that use-case?

Yes, that's the general idea of producing an SBOM. You have some body of code X which relies on dependencies Y. X might itself be the dependency of Z, or it might be a standalone codebase. Either way you can generate an SBOM taking X as the starting point.

I'm not familiar with CycloneDX, is there tooling available to easily produce summaries of the diffs sliced by major/minor, or top-level vs implicit dependencies? Would it be easy to answer questions like "does this update bump any major versions?"

I'm not sure, but it should be doable with some regular data slicing tools (R, Python, json operators in SQL etc). CycloneDX can be expressed as either XML or JSON; I figure we'd use JSON since it's a little less pokey on the eyes.

My feeling right now is that this is maybe a complimentary thing, or maybe a different thing that removes the need for my suggestion, and I'm not sure which ... I think from the tooling side, the standardised BOM option is probably better for larger shops with multiple languages, but the json or yaml on stdout approach is probably easier for smaller tooling jobs?

It's true that using one of the two major SBOM formats (the other is SPDX) makes it easier on large shops. But the problems of managing dependencies is one everyone faces. Right now the state of the art is Software Composition Analysis tools -- essentially they try a mix of parsing tool output, parsing dependency manifests and lockfiles, and looking up hashes in databases. But the ideal world will be for the tools with the most direct context to generate an SBOM.

I don't mind adding machine-readable output for other commands, but for anything that talks about gem names and versions, I strongly feel we should embrace an existing format.

Let me know how the OWASP CycloneDX community can help.

rubygems / rfcs