Discussion: Background job infrastructure

tsenart commented 4 years ago

Sourcegraph's architecture is currently missing infrastructure that makes it easy to run background jobs in response to events happening elsewhere inside Sourcegraph — e.g. repo X got new commits on branch Y, so both language statistics, zoekt indexing and LSIF auto indexing should run.

Every time there's a new similar use case to solve it's a whole novel engineering challenge to undertake:

New "background service" must be setup (e.g. precise-code-intel-indexer, zoekt-indexer, future: language-stats-job) — hard to do, can we make it easier?
New service must take work from somewhere else: a queue in Postgres or Redis, or some other times, an incoming RPC.
Some other times, new service has no way to compute things incrementally, so it must re-compute everything (batch, inefficient).
New service must solve reliability concerns from scratch: retries, poison pills, noisy neighbours, timeouts.

Here is a non exhaustive list of things we already do that could leverage common infrastructure:

Language Statistics are often turned off at large customers because they are so slow to compute at read time. (#11466)
Many Code Insights use cases will need to have data computed and cached ahead of time — the current approach of leveraging search for queries at different points in time won't scale. (cc @felixfbecker)
Zoekt code search indexing is one kind of background job that could leverage more general infrastructure.
Saved searches.
LSIF auto indexing
Campaign jobs
Git repository cloning and periodic fetching
Background permissions, repositories and changesets syncing peg holed into repo-updater.
Maintenance and scheduled clean-up jobs
Other analytics use cases.

But, most importantly, how many new use cases will pop up in the next 2 years that will require engineers to do all of the above to come to a production ready solution?

I think there are two problems that, if solved well, could massively speed up lead time in development of features that require this sort of background work:

Make it easy to setup a new background service that is reliable against intermittent failures, poison pills, noisy neighbours, etc — build standard tooling for this.
Make it easy to react to events happening elsewhere at Sourcegraph (i.e. another service) in real time — setup common infrastructure for this (i.e. Redis Streams, Kafka).

mrnugget commented 4 years ago

I agree with the need for a unified abstraction & infrastructure.

I think the two types of "background job/worker" we currently have in place in campaigns show the two categories of background processing we need to address (and shouldn't conflate in this and further discussion):

Long-running background process that runs on a time interval. In campaigns that's, for example, the ChangesetSyncer that updates the changesets table to reflect the newest state on the code hosts. The ChangesetSyncer currently runs in the enterprise repo-updater.
Background processes that react to events, i.e. jobs. We have a changeset_jobs table that gets an entry when we need to create changeset (i.e. PR) on a code host. Multiple long-running worker goroutines in repo-updater pick up new entries and then execute and mark the job as executed in the DB. These also run in the enterprise repo-updater.

The fact that these run in repo-updater gives us a hint that either (a) repo-updater should be renamed or that (b) we were lacking infrastructure for long-running background processes that have access to the PostgreSQL database.

ryanslade commented 4 years ago

I totally agree that we need this. Below is a bit of a brain dump.

For event handling, I think it's important to think up front about how we group the consumers. My assumption is that workers will be stateless and so can be easily scaled up or down in number. The two extremes are having a single worker service vs a microservice approach. Personally I would prefer keeping the number of workers services as close to 1 as possible. To be clear, that's one service that can have more than one running instance.

The argument for and against are the typical monolith vs microservices arguments. My main concerns in using a single service in our context would be:

Noisy neighbour
Single service would be responsible for many different domains. (I think we could mitigate a lot of this with well structured code though)
Will background workers need to run typescript or anything other than Go?

But, I think the operational complexity of microservices are not worth the cost. Especially for a small team and considering we need to support customer deployments.

Other random thoughts:

Do we want to be able to replay events or can they be forgotten once complete? This narrows the the choice of queue significantly.
All handlers need to be idempotent. Any queue that tells you it can do exactly once delivery is lying :)
Some queue's can do the retry, backoff, timeout logic for us

In terms of actual infrastructure:

Ideally, if we can get away with using Postgres or Redis I think we should
Kafka is pretty heavy to operate. We trialled it at my previous company and it requires more attention than you would expect
We were very happy with NSQ at my previous two jobs (https://nsq.io/)

asdine commented 4 years ago

I agree that there are two problems, and perhaps they need two different solutions.

Problem 2/ comes with a set of issues that need to be taken into consideration. Here are some random thoughts based on my experience, though I'm not sure if everything applies for our use cases:

we need a piece of infrastructure that fits our needs in terms of complexity. Kafka is awesome but it's really heavyweight. As @ryanslade said, if we can use Postgres or Redis to do the job that would be great. What's important here is to make sure we have persistence, control over consumption ordering and a concept of consumer group so that we can parallelize the consumption if needed.
we need to think about the content of each event and backward/forward compatibility. Are they short-lived or are we gonna progressively decide to use them for something more long-lived (analytics for example)? In any case, we may need to think about what encoding format we want to use, if we want to use schemas or not, how we handle schema changes (adding fields, changing field types, etc.). Each of these choices can have its own effects on consumption ordering, duplicate consumption, etc.
we also need to agree on a set of rules and conventions: naming, multiple schemas per topic vs only one, common attributes (event ids, dates, metadata if any) etc., because having to deal with different rules on consumer side can quickly become really complex. Some people have already tried to answer these questions here https://github.com/cloudevents/spec; though it might be heavyweight for our use case, it can help to take a look.

Relying on events can be really nice for decoupling processes but it comes with a set of issues that really need to be thought through, otherwise, the tech debt can quickly be overwhelming.

efritz commented 4 years ago

For some more context (because I've written a handful of things that needed this in the past and got around it in various ways):

A missing use case: The precise code intel backends also maintain a graph of commits per repo in order to calculate nearest uploads that can answer queries. We're soon going to need to rewrite this to support monorepos with very high commit rates (see https://github.com/sourcegraph/sourcegraph/issues/11627).

All handlers need to be idempotent.

ALWAYS

Kafka is awesome but it's really heavyweight.

If we're looking for new infrastructure options, I'd vote to use an AMQP service like RabbitMQ, which is operationally far easier than something like Kafka. However:

Ideally, if we can get away with using Postgres or Redis I think we should.

I agree with this. I think Postgres is a better choice at this point, as both LSIF upload processing and auto indexing use Postgres as a work queue. If it works for other use cases, I can generalize this part of the code intel store package so that it can be extended with other records. I think this would be a nice way to clean up that portion of the store regardless if others want to use this or not.

Initially we used redis to store work and used a celery/resque-type work queue, but that didn't give us some features we wanted (text/field search in the jobs, deleting queued jobs, etc).

The current system also allows for limited retries (to stop poison input from being reprocessed indefinitely) and pushback (we re-queue work for later if a repository is still cloning, for example).

Do we want to be able to replay events or can they be forgotten once complete?

If we can't replay events then workers starting up may need to batch everything in order to get to a consistent state (or accept that accepted RPCs can fail permanently). This may need to be made on a case-by-case basis.

The two extremes are having a single worker service vs a microservice approach. Personally I would prefer keeping the number of workers services as close to 1 as possible.

I think I would agree in general, but LSIF indexing (and eventually campaigns jobs) are going to be an exemption to this: this milestone we're planning on enabling arbitrary build steps that need to run before indexing a repository. I'm not completely sure how to do this yet, but I assume that its not something we'd want in the same compute space as unrelated jobs (or even other indexing jobs).

mrnugget commented 4 years ago

+1 on using Postgres for now. I don't think we have any needs at the moment that Postgres can't fulfil. But we should standardise on how and in which service it's used.

For now, I think, having an easy-to-find place in repo-updater (which should be renamed) where job workers are started would great.

tsenart commented 4 years ago

The one concern I have with using Postgres as a durable queue is its limitations around write rate and performance. Think of Sourcegraph Cloud scale you want to target in 1 year from now. What would the write rate look like? If you think it'd be fine, refactoring and standardizing how we use Postgres as a queue would be ace.

However, I don't think we should use Postgres as an event bus by writing events as rows in tables (i.e. unified log thingy). That would definitely go beyond what it was designed for. There is an interesting feature of Postgres, however, that could help us with this use case: https://tapoueh.org/blog/2018/07/postgresql-listen-notify/

I can imagine a git_commits table that gets populated by gitserver (see #11169). Then LSIF services, Zoekt and any other service that'd be interested in new commits in this table would LISTEN and receive notifications in near real-time. The same could be done for any other table, or more generally, for notifications not bound to a specific table.

We could build shared Go libraries to use this mechanism really easy and robust.

Thoughts on trying this out? What would the success criteria be for such an experiment?

efritz commented 4 years ago

We already use Postgres for event logs and that hasn’t taken down the world yet.

I’ve used https://godoc.org/github.com/lib/pq#Listener with success for a notifications system.

efritz commented 4 years ago

One problem with listen/notify is that we can’t replay events, though, so u less a service has 100% guaranteed uptime we’ll need to have a reconciliation process in place as well (at least on app startup).

Related, @gbrik and I have a meeting later today to discuss how to reorganize the commit metadata for lsif uploads, which is currently a table periodically updated by the precise code intel frontend code and worker. We’re looking at moving this to a different store or denormalizing it somehow to get around current query limits to support monorepos. At first I thought listen notify here would help, but we need to query this table directly after a resolve rev, and notify actions will be async. I think this feature is powerful, but won’t cover all use cases.

unknwon commented 4 years ago

The one concern I have with using Postgres as a durable queue is its limitations around write rate and performance. Think of Sourcegraph Cloud scale you want to target in 1 year from now. What would the write rate look like? If you think it'd be fine, refactoring and standardizing how we use Postgres as a queue would be ace. However, I don't think we should use Postgres as an event bus by writing events as rows in tables (i.e. unified log thingy). That would definitely go beyond what it was designed for. There is an interesting feature of Postgres, however, that could help us with this use case: tapoueh.org/blog/2018/07/postgresql-listen-notify

I think stick with Postgres as long as we could is a good option. As for scale, can't we just set up another Postgres instance/cluster that only for events/notifications? Setting up any other queue service won't give us zero or less operational cost (I could be missing something here as I'm not very experienced in queue services).

keegancsmith commented 4 years ago

Postgres seems like a good first choice for something durable. Background jobs/etc is pretty standard fair, and there are likely 100s of different libs to achieve this in go (including persisting to postgres, but more likely redis). So I think even better is exploring the go landscape and picking something simple, but provides a simple UI we can embed, retries, logs, etc. This would give lots of motivation for switching other background jobs to it (and you will likely get random sourcegraph engineers opting in for the benefits, so spreads out the work).

I went through the list of examples to try and work out what the requirements would be. For a few examples it can be hard to imagine how just having a background job queue infra would help:

They are working on data that lives on a specific node (most jobs on stateful sets like gitserver, zoekt, etc)
They are a scheduler. The scheduler seems separate, but could leverage the queue when deciding to schedule something.

The first point hits a lot of potential use cases. I think ensuring whatever we pick provides a good reporting interface would provide a lot of value. EG even if a use case doesn't use the queue, reporting that a job exists (of some type possibly for a repo@ref) and reporting back some information (like success, retried, some logs, etc) would be of huge value.

Raw ramble of me evaluating the use cases

Language Statistics are often turned off at large customers because they are so slow to compute at read time. (#11466)

This is turned off because it uses up so much resources on gitserver it impacts the whole of Sourcegraph. This would likely need a different approach that more reasonably uses the background worker. For example this uses the same data as anything else that "indexes" a commit (like zoekt). But this would benefit from background queues. The feature would likely end up being different (maybe not available on every commit, or use background queue to more slowly compute)

Many Code Insights use cases will need to have data computed and cached ahead of time — the current approach of leveraging search for queries at different points in time won't scale.

Yeah anything analytics related seems applicable.

Zoekt code search indexing is one kind of background job that could leverage more general infrastructure.

The problem here is the artifacts it produces need to be stored somewhere. So if moved to a general worker tier, we will also need some blob infrastructure (eg minio) and some extra coordination. Additionally things like zoekt will be a very noisy neighbour. This might not be the most applicable example, although if we get nice logs/etc interface it would be good to adopt for visibility.

Saved searches. LSIF auto indexing

These both sounds like good uses cases of a job queue/cron system.

Campaign jobs

Does this mean generating campaign stuff, or the stuff that is synced?

Git repository cloning and periodic fetching

I'm not exactly sure how this would fit in. Maybe the current "fetch queue" could go onto this queue. But that queue is very ephemeral. The scheduler is more customized, so hard to imagine that being managed by some queue infra. Maybe the idea of having some reuse-able background worker reporting would be the highest leverage thing here?

Background permissions, repositories and changesets syncing peg holed into repo-updater.

Yeah these all seem related, but things that use schedulers have potentially mismatched requirements around deciding when to run a job? (like with git fetch).

Maintenance and scheduled clean-up jobs

Sounds good, especially for jobs which only speak to postgres / etc.

Other analytics use cases.

:+1:

efritz commented 4 years ago

LSIF auto indexing // These both sounds like good uses cases of a job queue/cron system.

I'd just like to reiterate that we already have a good job queue system for LSIF processing and indexing.

keegancsmith commented 4 years ago

What is it? Part of the point of this is some standardization and having a clear choice for new efforts. Not that what we do currently in different areas is insufficient. For example there needs to be a clear benefit to switch, which is why I think having nice features like a good UI for insight would be great. Possibly we can adapt what LSIF is using to be more general :)

mrnugget commented 4 years ago

Campaign jobs

Does this mean generating campaign stuff, or the stuff that is synced?

Right now we have multiple things running in the background:

The ChangesetSyncer that's running in repo-updater and regularly syncs changesets.
ChangesetJobs are written to the changeset_jobs table in Postgres and picked up by workers (in repo-updater) that do a SELECT ... FOR UPDATE SKIP LOCKED. But this will likely change in our current iteration and what we'll keep is the "background process picks up 'ready' things from the DB and works on them" but not necessarily the concept of jobs.
A loop that cleans up expired data (running in repo-updater).

efritz commented 4 years ago

@keegancsmith

What is it?

Sorry for the delay, I have a habit of reading emails in bed and forgetting to star some things. Here are some links:

tsenart commented 4 years ago

@efritz: It sounds like extracting that as a reusable library would be useful. Can we make that happen?

efritz commented 4 years ago

@tsenart Of course: https://github.com/sourcegraph/sourcegraph/issues/11786

tsenart commented 4 years ago

@efritz, @mrnugget: After #11786 is done, what existing refactoring opportunities are there? Can we list those in that GitHub issue? Additionally, can we add documentation page to our handbook saying this should be everyone's first choice for the given use case and ensure that our engineering on-boarding docs lead people through it (directly or indirectly)?

mrnugget commented 4 years ago

Other refactoring opportunities:

I know it's just naming, but I really think we should rename repo-updater. background-processor, workers, etc. would all be better names, even though they're super generic.
I think we should document the central place on where to start background process (which is often the repo-updater main() function in enterprise/open).

Regarding documentation: sure.

tsenart commented 4 years ago

I know it's just naming, but I really think we should rename repo-updater. background-processor, workers, etc. would all be better names, even though they're super generic.

But repo-updater is also an HTTP server that exposes endpoints which wouldn't really fit those names, right?

mrnugget commented 4 years ago

But repo-updater is also an HTTP server that exposes endpoints which wouldn't really fit those names, right?

Good point. I don't think an HTTP API per se is an argument against background-processor for example (since I'd expect an internal service to have an API), but the API is scoped to repo-related things.

slimsag commented 4 years ago

Re: naming, see https://github.com/sourcegraph/sourcegraph/issues/9785

mrnugget commented 4 years ago

Can we close this? We've been happily using workerutils in campaigns :)

ryanslade commented 4 years ago

SGTM, we're planning to use it in repo-syncer soon too.

sourcegraph / sourcegraph-public-snapshot

Discussion: Background job infrastructure #11672

Raw ramble of me evaluating the use cases