opensafely-core / job-server

A server for mediating jobs that can be run in an OpenSAFELY secure environment. q.v. job-runner
https://jobs.opensafely.org
Other
5 stars 10 forks source link

Outline of sync protocol for job-runner and job-server #84

Closed evansd closed 3 years ago

evansd commented 3 years ago

Apologies that this isn't as clear as it could be! It's a bit of thinking aloud piece for my purposes but I'm reasonably convinced that it will work and I can explain it properly in time (and also that there isn't anything radically simpler than would work instead).


job-server is responsible for creating/updating JobRequest job-runner is responsible for creating/updating Job

Both services have a copy of both tables, kept up-to-date by a pair of one-way syncs i.e.

This is slightly complicated by the fact that the only form of communication possible is the job-runner polling the job-server. (Aside: as I write this I've realised we could in principle use a websocket to allow two way communication but I don't think that really helps us.)

job-server has a web API which (among other things) has:

job-runner has a pair of "sidecar" processes:

If the syncing breaks for any reason (network issue or if one of the applications goes down) each application should be able to keep running independently and everything will resync when the connection is re-established.

How do we know what's been newly updated? Roughly speaking, by having a last_modified datetime on each model. However this isn't quite sufficient because:

We deal with this by:

So, specifically:

pull process

The job-runner "pull" process makes a request to the JobRequest endpoint on the job server In the request params it includes the maximum (timestamp, id) pair it has for all JobRequests in its database. The response contains all JobRequests where (timestimp, id) > (request_timestamp, request_id) in ascending (timestamp, id) order. (Subject to the limitation that the timestamps must be at least Nms in the past). The job-runner pull process then processes these JobRequests and commits them to its local db in the order it receives them.

push process

The job-runner "push" process stores in memory a (timestamp, id) pair which is the maximum (timestamp, id) stored on the job-server database. (We'll deal with how this gets set in a moment.) It polls its own local database for any more recent updates (subject to the same Nms buffer time as above). If it finds any it POSTs these to the job-server in (timestamp, id) order. (Could be in batch or one-by-one.) With each request it includes what it thinks is the maximum (timestamp, id) pair the job-server knows about. The job-server checks that this matches the actual maximum as stored in its db. If it does, it processes the updates and returns the new maximum (timestamp, id) If it doesn't, then it ignores the updates and returns the true maximum value. In either case, the push process updates its internal record of the maximum and continues. When the push process first starts it sets its internal maximum to whatever the maximum in its local database is (i.e. it assumes the job-server is fully up-to-date) and POSTs an empty update to the job server. If the job-server is up-to-date it will just handle the empty update and do nothing but if it's not up-to-date then this gives it the opportunity to correct the job-runner's state. As an extra layer of robustness, the push process will have some maximum time it will wait between updates. So if it doesn't find any updated records for N minutes it will POST an empty update (just as it does on start up). This ensures that if the job-server loses state for any reason it will eventually recover without having to restart the job-runner push process.

evansd commented 3 years ago

Further thought on this: we need to think about how this will work with multiple backends. Presumably the idea is that we'll have a single job-server instance and multiple job-runner instances, each hosted in a different environment.

For syncing JobRequests, this should be easy. We just need a backend filter on the GET endpoint which lists new JobRequests so each backend fetches only the JobRequests which are relevant to it.

Syncing Jobs is a little more involved, but not too bad. We need a field on Job (as stored by job-server) which tracks which backend a given Job came from. When the POST endpoint checks the (timestamp, id) pair to confirm that it matches its own latest state it needs to filter to only Jobs from the relevant backend.

The tricky bit will be primary keys as multiple backends will end up creating Jobs with the same auto-increment id. I can think of a few ways of dealing with this.

The above scheme should work even if we want a single JobRequest to create Jobs on multiple backends. We'd just need to make sure the JobRequest endpoint returned such JobRequests to all the relevant backends which would then create the related Jobs. These Jobs would then get synced back to job-server and end up related to the correct Job even if they come from different backends.

sebbacon commented 3 years ago

Presumably the idea is that we'll have a single job-server instance and multiple job-runner instances, each hosted in a different environment

Yes, current design works pretty much this way. There's an environment variable telling a runner what backend it is, and a field on a job model saying what backend it should run in.

Compound primary keys: I've never tried in earnest but I wonder why this wouldn't work. But yeh, "unexpected complexity"

ghickman commented 3 years ago

Sync State (timestimp, id) sounds great, and looks like it covers the expected failure cases (server or runner being offline for $time, etc). Should we treat this like HTTP's ETAG and include it as a header on all requests and responses?

GitHub do something similar with their remaining rate limit data. Since it's tangential to the payload data it fits nicely into the HTTP header model.

Cross System IDs UUIDs seems like the logical choice here. There's UUIDField in Django.

it doesn't matter whether our keys have a natural ordering

This applies to the UI as well, we can use created_at for JobRequests and #81 sorts Jobs in a JobRequest topologically.

I wonder why this wouldn't work

Unmanaged models will definitely add a lot of unexpected complexity!

Cross Backend JobRequests This sounds like a lot of added complexity?

In my mind we'd make JobRequest unique to a backend, pulling that field up from Job. The UI could then present a list of available backends on the Create JobRequest form, creating the necessary number of JobRequests under the hood. From the UI side of things having a JobRequest contain a group of Jobs running on one backend would simplify defining state for that request, eg "some Jobs have failed on backend A, but all succeeded on backend B" shouldn't "pollute" a JobRequest's status.

evansd commented 3 years ago

UUIDs seems like the logical choice here

Great. I think so too, for the reasons already covered and also for some new reasons which I'll post in a separate comment.

Cross Backend JobRequests This sounds like a lot of added complexity?

Yes, I agree there's potentially a lot of complexity here and I'm not suggesting we should do it (and certainly not right now). I just wanted to convince myself that the sync protocol would allow for this without further changes.

Should we treat this like HTTP's ETAG and include it as a header on all requests and responses?

Yes, that's a good idea. There may even be some HTTP semantics that exactly match the behaviour we want from the POST endpoint (something to do with only performing an update if the value of a certain header matches) but I'd have to look that up.

evansd commented 3 years ago

It strikes me that one further event it would be useful to be able to handle gracefully is a total reset of state on either the job-runner or job-server i.e. blowing away the existing database and starting from scratch. I can imagine circumstances where this might happen either deliberately or accidentally and it would be nice if this was something the system could take in its stride.

This is one further argument for UUIDs over auto-increment IDs because we no longer have to worry about primary key clash in these circumstances.

Below is how we could handle each table in the event that each application resets its state.

job-runner resets state

JobRequest I think the only change we have to make is that when the JobRequest endpoint receives a request to list jobs and the supplied (timestamp, id) pair is null (which is what will happen when the job-runner has no JobRequests in its database) then rather than returning all JobRequests, it should return only those for which it has no associated Job for that backend. Otherwise what will happen is that the job-runner will pick up all historical JobRequests and start executing them. The existence of an associated Job shows that this JobRequest was already processed by that backend and shouldn't be processed again. Once the job-runner has one JobRequest in its database it will only ask for more recent updates and so everything should proceed as normal.

Job Nothing special needs to be done here. The job-runner will POST an initial update with a null (timestamp, id) pair, the job-server will say "this doesn't match what I have, which is (timestamp-of-last-job, X)" and the job-runner will proceed as normal.

job-server resets state

JobRequest Nothing special needs to be done here. job-runner will continue to pick up new JobRequests as they are created.

Job The issue here is that the job-runner might post updates to Jobs where the associated JobRequest now doesn't exist in the job-server database, causing an integrity error. I think the right thing is probably for the job-server to just silently drop such updates (with an appropriate log message).

This leaves us with another issue which that the job-runner will then attempt to sync its entire Job table up to the job-server and, as all these writes will be silently dropped, it will just continue to try to do this until a new Job gets created. This has the potential to turn into a bit of a self-DoS attack.

I think a solution here is that where the job-server has no Jobs in its database rather than returning a null timestamp it should return the timestamp of the oldest Job in its database i.e. please don't tell me about any Jobs that predate my earliest JobRequest. Where there are no JobRequests in its database either it should just return the current timestamp.


You may say I'm overthinking all this (and you may well be right!) but in my experience getting this state management stuff right at the beginning really pays off both because state related bugs can be a nightmare to diagnose and unpick and also because it's incredibly liberating to be able to kick your system around with reckless abandon and know that (within reason) it will recover itself gracefully without intervention.

ghickman commented 3 years ago

You may say I'm overthinking all this

TLDR: not at all.

So far, not at all! We have some really interesting architecture to incorporate, and as you said "getting this state management stuff right at the beginning really pays off"!


a total reset of state

In my experience, just having this for local dev so no-one has to remember/script some arcane placing of various pieces "just to list some jobs" is really important for dev focus and cadence.

The existence of an associated Job shows that this JobRequest was already processed by that backend

Yes! This fits so nicely with our "runner handles Jobs, server handles JobRequets" split.


This is starting to get a little complex, should we try to formalise this in a rudimentary spec document defining each of the use cases/failure modes you've thought of so far with expected responses? It probably makes sense for me to write that and you to review it so we can check we're on the same page?

evansd commented 3 years ago

I'm happy to go with that if you think that would be helpful. My instinct would be that it might be most productive to start building it based on the notes here and then write up the spec alongside the process of actually writing code. It's usually only when you have to start building something that the inadequacies of the spec are revealed! And unless we think the process of formalising the spec will reveal that we have to throw the whole design away and start again I don't know how much extra we'll gain by it.

That's just an instinctive response though, and certainly open to other opinions.

ghickman commented 3 years ago

Spec might have been the wrong word to use, it potentially implies something stricter than I was going for! Reference document is possibly the better word? My main aim is to turn the many many (good!) words above this comment into a (living!) document which I can refer to 🙂

evansd commented 3 years ago

Ah right, got you! Yeah, that does sound like a good idea.

sebbacon commented 3 years ago

The issue here is that the job-runner might post updates to Jobs where the associated JobRequest now doesn't exist in the job-server database, causing an integrity error.

Why shouldn't the job server create a JobRequest if none exists?

sebbacon commented 3 years ago

My main aim is to turn the many many (good!) words above this comment into a (living!) document which I can refer to

How about posting the summary here?

evansd commented 3 years ago

Why shouldn't the job server create a JobRequest if none exists?

It won't know any of the relevant details e.g. who created it

How about posting the summary here?

Yes, you could just make it a comment here that we edit as we go along to keep up-to-date.

sebbacon commented 3 years ago

Why shouldn't the job server create a JobRequest if none exists? It won't know any of the relevant details e.g. who created it

No, but it could still be a useful entity for reviewing stuff. <unknown user> JobRequest, by which all jobs are grouped in the web UI?

bloodearnest commented 3 years ago

Caveat: I may be missing context here, so feel free to shoot this down if there's something obvious I've missed.

First, a few observations:

1) If it wasn't for the security boundary, this whole thing would likely be one service. I think that this has taken us down to path of thinking about it like it is one service, and thinking around DB-level syncing. However, it's not a single service, and I think it might be better to treat it as two independent services, with separate implementations, include DB schemas (and even DB types!).

2) The number of jobs we will run via these services is small, <100/day for a good while into the future. Likewise, the data transferred about the job and any status updates or results is very small, a few Kb of JSON perhaps. All the real data is in Github or local to the job-runner. This is small enough scale that we can easily send the entire historic state of a whole job in one go, I think, rather than drip feeding change-by-change.

3) The current proposal is not trivial, and maintains a clock state for the purposes of syncing that requires careful thinking, implementation and testing (it reminds me of Lamport clocks somewhat). It seems to be based on trying to sync at the db table level, e.g. using incrementing primary keys as counters. This does not seem to me like the right level for this problem - we don't need the fast or highly distributed synchronisation that these kinds of approaches are good at. Instead of inferring state syncing via timestamps, I think we can afford to be explicit and verbose about it, even if that is slower.

So, if I may offer a counter-proposal:

1) We separate concerns, and do not leak or rely on implementation details like integer primary keys, and communicate via UUIDs[1] to identify jobs. This means the DB schemas for each service can diverge and we don't need to keep them in sync. This means job-runner can add stuff only it cares about, and job-server likewise. Given job-server is postgresql/django and job-runner is I think sqlite currently, this seems sensible (and those seem like reasonable tech choices). To facilitate a clear interface, we have a shared python library defining the message schemas used to communicate, using something like Marshmallow. Both services use this to validate and serialise messages from/to the other, and it's versioned carefully and with backwards compatibility.

2) We add the concept of an ACTIVE job. This is a job that has not yet finished from the job-runner's perspective. When a job is done (success or failure) and the job-runner does not need to do anything else with it, it can communicate the final state with the job-server, which will marking that job as INACTIVE in the job-servers local state. This keeps the working set of jobs small, and means we can discard INACTIVE jobs from the set of jobs that we need to care about syncing.

3) The job-runner polls the job-server for jobs. The job-server returns a list of every currently ACTIVE job. INACTIVE jobs naturally drop off this list, keeping the current set to probably <10.

4) When a change happens to a job on the job-runner (e.g. started, progress, failed, finished, ping, restarted), job-runner posts the entire historic state of that job to the job-server, including the latest change. As the job-runner owns this state, the job-server just consumes that entire state as the current state of the job. This means the job-server doesn't need to track previous updates, just the latest, and keeps job-runner in charge of the state, and it's impossible for them to disagree.

If there is a network partition, when things reconnect, job-server will return its current ACTIVE jobs in the poll, and if they've all finished, job-runner will resend the final state and everything will be in sync. Likewise, any new jobs added to the job-server while the partition was in place will be started up and run as soon as the services reconnect.

In a disaster scenario where we lose the either the entire TPP server or job-server state, then we've got bigger problems than spending time now designing for that scenario. These jobs are idempotent and re-runable, so there's a limit to how much time we should spend trying to cope with extreme failure :)

[1] Whether UUID is primary key or not is a service internal implementation detail. In django, default primary key with a UUID column with unique constraint should work fine.

evansd commented 3 years ago

Thanks Simon, it's really helpful to have a fresh perspective on this and your proposal has helped unstick a blindspot I had. Ultimately, whatever happens we've got the job-runner pulling blobs of JSON down from the job-server and firing blobs of JSON back. And the question is how we determine which blobs need sending (given that we can't get away with sending all of them every time). I'm very happy being explicit, verbose and sending more data than necessary if it buys us simplicity (like you say, there are no real scaling issues here). I just couldn't quite see the simpler solution, hence the Lamport-inspired state tracking.

The path I got stuck down was thinking that it had to be up to the job-runner to ask for the things it was missing, rather than the job-server trying to keep track of what had been sent. In the general case I still that that's right. But because it's the job-runner that declares things INACTIVE, and because once a JobRequest is INACTIVE there's nothing more to be done with it, I think your proposal will work just fine.

On the question of whether we're syncing database tables or exchanging messages, I think the idea (discussed but not written down) is that in practice the schemas in each application will probably track each other pretty closely, but the JSON interface gives flexibility for them to differ - either temporarily during migrations, or because there's extra state one application wants to track for its own purposes which doesn't need to be shared. So I think we're really on the same page here. You're quite right though that we could have internal primary key in addition to the UUID which doesn't get shared and that might make things simpler with the ORM.

The one bit I'd disagree with is that total loss of state represents an extreme failure case that it's not worth planning for. Hopefully, in the long term this will end up being the case. However I know that for instance yesterday Seb blew away the job-runner database several times while trying to debug various issues. This stuff is quite ephemeral really - it's the output files on disk we really care about - and it's worth putting a bit of thought into how the system behaves in this case. Fortunately, I think your proposal works out of the box in this scenario!

I'm going to try to write up (my understanding of) your proposal in a separate comment. What you've written above needs a bit of tweaking to e.g. account for the distinction between JobRequests and Jobs.

ghickman commented 3 years ago

To add a bit more context here @bloodearnest and I did a context sharing call yesterday to try and cover any blindspots he had (which I hope we managed to do, it seems like everyone is on the same page now at least!).

Given our discussion previously about state being owned by job-runner I think the idea of INACTIVE/ACTIVE really enforces that boundary (👍). As a knock-on effect it will also simplify inferring a JobRequest's state from it's related Jobs.

evansd commented 3 years ago

Summary of where I think we're at

A JobRequest is inactive iff it has at least one associated Job and all associated Jobs are in a terminal state (whether failed or succeeded). A JobRequest is active otherwise.

The job-server has a JobRequests endpoint which takes a backend name as a parameter and returns all active JobRequests for that backend.

The job-runner has a sync service which runs alongside it. It polls the job-server for any active JobRequests. It checks against its local db and creates any new JobRequests and updates existing JobRequests as appropriate. When doing so, it creates any new Jobs as appropriate.

It then takes the list of active JobRequests it received, finds all associated Jobs, and POSTs them back to the job-server.

It sleeps for a bit and then does it all again.


While not the most efficient or lowest latency protocol, this has the advantage of being extremely simple and hopefully pretty robust (thanks Simon!).

evansd commented 3 years ago

One small point that needs deciding on is authentication for the job-server's endpoint. I think the JobRequest GET endpoint can just be public. There's nothing secret in there and we were planning to make a public audit log of all JobRequests in any case.

The Jobs POST endpoint will need some kind of auth on it just because we don't want random jokers messing with our state. But I don't think it needs to be anything particularly involved. Possibly just http basic auth with a shared secret would do.

And actually, if need some kind of auth anyway maybe we should apply it to the GET endpoint too, just so we have complete control over what ends up in the public audit log and what stays private.

sebbacon commented 3 years ago

We actively want the data on the job server to be public:

  1. Transparent, auditable, setting cultural expectations, etc
  2. It's easier to think about security if the rule is "anything that leaves the secure environment can be public"

POSTS need to be protected to prevent spam, DOS, etc. The authentication @ghickman has now implemented is Github OAuth which is definitely the way to go. Membership of the opensafely org gives you posting rights at the moment.

ghickman commented 3 years ago

I think @evansd was talking about job-runner POSTing to job-server when he mentioned basic auth? We can't use GitHub OAuth for the runner to talk to the API, so basic auth should be fine there? (assuming we're going over HTTPS)

sebbacon commented 3 years ago

We can't use GitHub OAuth for the runner to talk to the API

Well, we could, but thinking about it again I guess it seems like unnecessary complexity.

sebbacon commented 3 years ago

It checks against its local db and creates any new JobRequests and updates existing JobRequests as appropriate

When would it create or update JobRequests?

There's the "server database disappeared" scenario, where we've discussed the job-runner potentially making stub/"unknown" JobRequests to group Jobs, but apart from this, I think of a JobRequest as entirely owned by the user; it simply has inferred state based on its associated jobs.

evansd commented 3 years ago

Sorry, I wasn't clear. I meant: create/update it's local JobRequests such that match exactly what it has just received from the job-server.

ghickman commented 3 years ago

It checks against its local db and creates any new JobRequests and updates existing JobRequests as appropriate

When would it create or update JobRequests?

~I think this was intended to happen in the job-runner database?~

Beaten by the ninja update!


Simon brought up yesterday: Does job-runner need to have a local copy of JobRequest? (other than Job.request_id: int)

From the perspective of talking to API I think it might add some complexity but I have a feeling not caring about it in job-runner's DB might be a nice simplification of things.

JobRequest models "A Human would like to run a Job with this $config". job-runner deals in Jobs and translating a JobRequest into the necessary Jobs to achieve the request. All the relevant config captured by a JobRequest is copied onto each Job.

So I'm very open to being told I'm wrong here, but what's left for job-runner to care about from a JobRequest in it's local DB?

evansd commented 3 years ago

Ah, that's a good point. The JobRequest id will get stored on each of the related Jobs that the job-runner creates, so it doesn't need a JobRequest table at all :+1: Maybe we should log the JobRequests on the job-runner as I can imagine it might be useful for debugging, but they don't need to be part of its formal state.

bloodearnest commented 3 years ago

Question: does the parsing of a JobRequest's associated project.yaml and mapping to multiple Jobs happen in the job-server before it's sent to the job-runner? Or when the job-runner first processes the JobRequest?

That has a big implication of the design of the messages, AIUI.

bloodearnest commented 3 years ago

The one bit I'd disagree with is that total loss of state represents an extreme failure case that it's not worth planning for. Hopefully, in the long term this will end up being the case. However I know that for instance yesterday Seb blew away the job-runner database several times while trying to debug various issues. This stuff is quite ephemeral really - it's the output files on disk we really care about - and it's worth putting a bit of thought into how the system behaves in this case. Fortunately, I think your proposal works out of the box in this scenario!

Great, I had thought through that scenario, but didn't want to lengthen an already long comment with it!

I think the only consideration in this case is surfacing to the user that state was lost and the job has been restarted. Which means we may want to keep the historical state for a Job sent from the job-server. If each Job also has it's own UUID, any update from the job-runner has a different Job ID that is different from a previous update, we can infer that the Job was restarted, and show that info (and optionally the old Job's status) in the UI, or even notify as needed.

But that's a UX optimisation that can be added later, but it'd be good to add Job UUIDs now.

evansd commented 3 years ago

Question: does the parsing of a JobRequest's associated project.yaml and mapping to multiple Jobs happen in the job-server before it's sent to the job-runner?

This is a very important question. Previously it all had to happen on the job-runner because only it had the github credentials to be able to fetch the project.yaml. With the Github auth George has added it would now be possible to do this on the job-server but I still think it belongs on the job-runner for a couple of reasons:

I think a key element though is making sure everything is done with explicit shas so both applications can agree what's being run. If that's the case then job-server can also do dependency resolution if that's useful for UX purposes and know that it will get the same result.

ghickman commented 3 years ago

does the parsing of a JobRequest's associated project.yaml […] happen in the job-server before it's sent to the job-runner?

No… and yes. No, because the parsed config isn't passed to job-runner, job-runner's interpretation of that file is always the canonical source of config. Yes, because of #61, and I think we want to validate project.yaml in the web interface to surface issues early on for the user (and a slight optimisation for job-runner in that it will [hopefully] get fewer invalid JobRequests). As Dave pointed out, we should maintain this boundary since we want job-runner to always be the source of truth for this given it can read the file system.

does the parsing of a JobRequest and mapping to multiple Jobs happen in the job-server before it's sent to the job-runner?

No, job-runner is the only thing which creates Job instances, both in its local db and via the job-server API.

bloodearnest commented 3 years ago

Question: does the parsing of a JobRequest's associated project.yaml and mapping to multiple Jobs happen in the job-server before it's sent to the job-runner?

This is a very important question. Previously it all had to happen on the job-runner because only it had the github credentials to be able to fetch the project.yaml. With the Github auth George has added it would now be possible to do this on the job-server but I still think it belongs on the job-runner for a couple of reasons:

  • It's better if the job-runner doesn't have to trust the job-server any more than necessary.
  • Only the job-runner know what output files have already been created and therefore which dependant jobs need executing. Although it would be possible to separate these out it feels like the dependency resolution belongs more naturally with the job-runner.

Right, that's what I had been assuming. I just realised I wasn't 100% sure though!

I think this means the JobRequest is very small, and the job-server needs to able to accept and display an arbitrary set of Job states from the job-runner, as it has no foreknowledge of the individual Jobs. This seems sensible and robust to me.

Down the line, it might be nice if the job-server can checkout and validate the project.yaml before sending it to the job-server, surfacing any issues with it in the UI. But that's another later UX improvement, perhaps.

I think a key element though is making sure everything is done with explicit shas so both applications can agree what's being run. If that's the case then job-server can also do dependency resolution if that's useful for UX purposes and know that it will get the same result.

Good point. Ideally, we'd let the user choose the sha, with the default being current master at the time of submitting the request. This would require auth to GH though.