Closed evansd closed 1 week ago
Having investigated the options for fast YAML parsing I think we currently have two, with different trade-offs:
pyyaml is the default package most people reach for (for instance it's already installed in job-server). It's tolerably actively maintained in the sense that people seem to keep the build working for new versions of Python. I'm not sure much in the way of bug fixes or feature development happens though. Most importantly for our purposes, it provides binary wheels for a wide variety of platforms and Python versions.
However it has a significant downside (the reason we initially switched away from it) in that it silently ignores duplicate keys. This is a reasonably likely kind of error for a user to make while copy/pasting actions and it would have the effect that the project.yaml
would appear valid locally and Job Server but fail when actually submitted.
ruamel.yaml.clib is the compiled backend for ruamel.yaml
which is the parser we used to use before we switched to ruyaml
because that seemed better maintained. This correctly rejects duplicate keys. It also currently has all the binary wheels we need. But these were last published a year ago and I don't see any activity since, so I don't know what it's long term future is.
Both packages parse at approximately the same speed (well over an order of magnitude faster than the pure Python parser).
Overall, given that this is fundamentally a short-term hack to buy us some time, I think we should go with ruamel.yaml.clib
and accept that we might have to do something different in future. This means adding, as dependencies to job-server and our Ccodespaces setup:
ruamel.yaml
ruamel.yaml.clib
And also helping any users who are struggling with large yaml files locally to install these packages.
Have you checked parsing compatibility on the files of interest for this?
One of the reasons yaml is a bit of a nightmare is that basically no two yaml parsers are actually compatible. They pretend to be, and manage compatibility for simple examples, but the yaml semantics are complicated enough that you inevitably run into discrepancies, and if you're talking about really large YAML files I'd assume by default that some of these are lurking in there somewhere.
We could add it as a dependency to opensafely
package, perhaps?
Maybe as an extra, and tell users to do pip install opensafely[fastyaml]
if they're having problems?
Have you checked parsing compatibility on the files of interest for this?
So. Yes. This is one of the many terrible things about YAML which make me deeply regret its use in OpenSAFELY, and determined that The Glorious Future Pipeline Format will use something else. In this particular case though, the YAML is all pretty vanilla and stays well clear of the ambiguous corners of the spec. The reason they get so large is that we have users who want to run a small number of actions many times with different parameters and so they write scripts which generate these files. No one (I dearly hope) is writing 3MB of YAML by hand.
We could add it as a dependency to
opensafely
package, perhaps?
I think that until we switch our whole installation story to using uv
then we need to stick with our current zero-dependency-vendor-everything model by default. Using an extra
is a neat idea which hadn't occurred to me. But I wonder how much it really buys us given that I don't anticipate this being very widely used, and certainly not part of our official installation story. I'm hoping we can personally guide the few users who need this right now through the installation. And then switch to uv
before it becomes a really widespread problem. May be over-optimistic of me though!
No one (I dearly hope) is writing 3MB of YAML by hand.
No, for sure, I wasn't imagining this was manually written, but I've definitely run into cases where YAML emitted by one implementation is not interpreted in the same way by another, and it would make me worried about implementing something like this without an explicit sanity check on compatibility.
By "sanity check" do you just mean checking that existing files parse identically? If so, I have done that and it looks OK, but obviously that's no proof it will always be so. Is there something cleverer we could do here?
I think it helps that the two parsers here share a common lineage. Ultimately the risk of divergence is always going to be present. But it feels like the least worst of our currently available options in the short term.
@bloodearnest On reflection, I think your suggestion about using an opensafely[fastyaml]
extra install is a good idea and we should do it now. I was previously just thinking about local user installs, but of course we'll want the dependencies in job-server and in codespaces as well. So we really want the opensafely-cli
package to be in control of what gets installed.
By "sanity check" do you just mean checking that existing files parse identically? If so, I have done that and it looks OK, but obviously that's no proof it will always be so. Is there something cleverer we could do here?
No, I think that's mostly sufficient to requirements. If there are discrepancies later those are just bugs that can be fixed.
https://github.com/opensafely-core/pipeline/pull/346/files#r1796580792 is the other big sanity check that I think is probably necessary to make this safe to run, but I've added that detail there.
We're using pure Python parsing for two reasons:
But for very large projects involving megabytes of YAML it's intolerably slow. This makes local
opensafely run
very difficult to use, and makes it impossible to dispatch jobs in production because the page times out.A short term workaround for this would be for the pipeline library to first attempt to parse the YAML using a fast, compiled parser (if one is importable) and only if that produces errors to re-parse it using the pure Python parser to get the helpful error messages.
In psuedo-code, what I'm proposing is something like:
This does make the unhappy path slower, but not by much. And it would massively speed up the happy path, assuming that a fast parser is importable. There are three different contexts we need to think about.
1. Job Server
Here it looks like
pyyaml
is already be available so there'd be nothing more to do than upgrading the pipeline library. https://github.com/opensafely-core/job-server/blob/4814a7a17d42c55a508fb527c2b3c5a9121027c3/requirements.prod.txt#L7942. Codespaces
I'm not exactly sure of the mechanics here, but presumably we can use whatever mechanism we do for ensuring that
opensafely-cli
is installed to also installpyyaml
.3. Running locally
This is obviously the hardest part. I think in the first instance we'd just need to talk the affected users through installing
pyyaml
(or whatever we choose). That's obviously not sustainable, but it makes it practical right now for these users to interact with their projects locally which I think is really important.Longer term, if we move to using
uv
for local installation then the need to keep all our dependencies as pure Python goes away.