Proposed short-term fix for the slow parsing of Very Large `project.yaml` files

evansd commented 2 weeks ago

We're using pure Python parsing for two reasons:

It's required for our cross-platform, vendor-everything installation story.
It produces more helpful error messages for invalid YAML.

But for very large projects involving megabytes of YAML it's intolerably slow. This makes local opensafely run very difficult to use, and makes it impossible to dispatch jobs in production because the page times out.

A short term workaround for this would be for the pipeline library to first attempt to parse the YAML using a fast, compiled parser (if one is importable) and only if that produces errors to re-parse it using the pure Python parser to get the helpful error messages.

In psuedo-code, what I'm proposing is something like:

try:
    import fast_yaml_parser
except ImportError:
    fast_yaml_paser = False

def parse_yaml(yaml_string):
    if fast_yaml_parser:
        try:
            return fast_yaml_parser(yaml_string)
        except Exception:
            pass
    return slow_but_helpful_yaml_parser(yaml_string)

This does make the unhappy path slower, but not by much. And it would massively speed up the happy path, assuming that a fast parser is importable. There are three different contexts we need to think about.

1. Job Server

Here it looks like pyyaml is already be available so there'd be nothing more to do than upgrading the pipeline library. https://github.com/opensafely-core/job-server/blob/4814a7a17d42c55a508fb527c2b3c5a9121027c3/requirements.prod.txt#L794

2. Codespaces

I'm not exactly sure of the mechanics here, but presumably we can use whatever mechanism we do for ensuring that opensafely-cli is installed to also install pyyaml.

3. Running locally

This is obviously the hardest part. I think in the first instance we'd just need to talk the affected users through installing pyyaml (or whatever we choose). That's obviously not sustainable, but it makes it practical right now for these users to interact with their projects locally which I think is really important.

Longer term, if we move to using uv for local installation then the need to keep all our dependencies as pure Python goes away.

evansd commented 1 week ago

Having investigated the options for fast YAML parsing I think we currently have two, with different trade-offs:

pyyaml is the default package most people reach for (for instance it's already installed in job-server). It's tolerably actively maintained in the sense that people seem to keep the build working for new versions of Python. I'm not sure much in the way of bug fixes or feature development happens though. Most importantly for our purposes, it provides binary wheels for a wide variety of platforms and Python versions.

However it has a significant downside (the reason we initially switched away from it) in that it silently ignores duplicate keys. This is a reasonably likely kind of error for a user to make while copy/pasting actions and it would have the effect that the project.yaml would appear valid locally and Job Server but fail when actually submitted.

ruamel.yaml.clib is the compiled backend for ruamel.yaml which is the parser we used to use before we switched to ruyaml because that seemed better maintained. This correctly rejects duplicate keys. It also currently has all the binary wheels we need. But these were last published a year ago and I don't see any activity since, so I don't know what it's long term future is.

Both packages parse at approximately the same speed (well over an order of magnitude faster than the pure Python parser).

Overall, given that this is fundamentally a short-term hack to buy us some time, I think we should go with ruamel.yaml.clib and accept that we might have to do something different in future. This means adding, as dependencies to job-server and our Ccodespaces setup:

ruamel.yaml
ruamel.yaml.clib

And also helping any users who are struggling with large yaml files locally to install these packages.

DRMacIver commented 1 week ago

Have you checked parsing compatibility on the files of interest for this?

One of the reasons yaml is a bit of a nightmare is that basically no two yaml parsers are actually compatible. They pretend to be, and manage compatibility for simple examples, but the yaml semantics are complicated enough that you inevitably run into discrepancies, and if you're talking about really large YAML files I'd assume by default that some of these are lurking in there somewhere.

bloodearnest commented 1 week ago

We could add it as a dependency to opensafely package, perhaps?

Maybe as an extra, and tell users to do pip install opensafely[fastyaml] if they're having problems?

evansd commented 1 week ago

Have you checked parsing compatibility on the files of interest for this?

So. Yes. This is one of the many terrible things about YAML which make me deeply regret its use in OpenSAFELY, and determined that The Glorious Future Pipeline Format will use something else. In this particular case though, the YAML is all pretty vanilla and stays well clear of the ambiguous corners of the spec. The reason they get so large is that we have users who want to run a small number of actions many times with different parameters and so they write scripts which generate these files. No one (I dearly hope) is writing 3MB of YAML by hand.

evansd commented 1 week ago

We could add it as a dependency to opensafely package, perhaps?

I think that until we switch our whole installation story to using uv then we need to stick with our current zero-dependency-vendor-everything model by default. Using an extra is a neat idea which hadn't occurred to me. But I wonder how much it really buys us given that I don't anticipate this being very widely used, and certainly not part of our official installation story. I'm hoping we can personally guide the few users who need this right now through the installation. And then switch to uv before it becomes a really widespread problem. May be over-optimistic of me though!

DRMacIver commented 1 week ago

No one (I dearly hope) is writing 3MB of YAML by hand.

No, for sure, I wasn't imagining this was manually written, but I've definitely run into cases where YAML emitted by one implementation is not interpreted in the same way by another, and it would make me worried about implementing something like this without an explicit sanity check on compatibility.

evansd commented 1 week ago

By "sanity check" do you just mean checking that existing files parse identically? If so, I have done that and it looks OK, but obviously that's no proof it will always be so. Is there something cleverer we could do here?

I think it helps that the two parsers here share a common lineage. Ultimately the risk of divergence is always going to be present. But it feels like the least worst of our currently available options in the short term.

evansd commented 1 week ago

@bloodearnest On reflection, I think your suggestion about using an opensafely[fastyaml] extra install is a good idea and we should do it now. I was previously just thinking about local user installs, but of course we'll want the dependencies in job-server and in codespaces as well. So we really want the opensafely-cli package to be in control of what gets installed.

DRMacIver commented 1 week ago

By "sanity check" do you just mean checking that existing files parse identically? If so, I have done that and it looks OK, but obviously that's no proof it will always be so. Is there something cleverer we could do here?

No, I think that's mostly sufficient to requirements. If there are discrepancies later those are just bugs that can be fixed.

https://github.com/opensafely-core/pipeline/pull/346/files#r1796580792 is the other big sanity check that I think is probably necessary to make this safe to run, but I've added that detail there.

opensafely-core / pipeline