sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

Validate reconciler works correctly #13648

Closed eseliger closed 3 years ago

eseliger commented 4 years ago

I've just ran a test campaign that creates changesets on GitLab, GitHub and Bitbucket Server, and the behavior seems odd to me.

  1. The BBS PR has been created successfully.
  2. The GitHub PR switched to "failed" correctly, because I reused a branchname.
  3. The GitLab PR stayed queued for quite a long time (few minutes)
  4. The GitHub PR keeps getting retried infinitely (according to the logs)
  5. The GitLab PR failed pushing the commit, now has an error message, but is still "processing" and continuously retries as well.

Screenshot: image

Logs: ``` 18:13:21 repo-updater | INFO Dequeued record for processing, name: , id: 39 18:13:21 repo-updater | INFO Reconciler processing changeset, changeset: 39, action: publish 18:13:21 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:13:21 repo-updater | INFO Dequeued record for processing, name: , id: 37 18:13:21 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:13:21 repo-updater | INFO Reconciler processing changeset, changeset: 37, action: publish 18:13:21 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:13:21 repo-updater | INFO Handled record, name: , id: 38 18:13:21 gitserver | INFO command ran successfully, prefix: 1 gitlab.sgdev.org/sourcegraph/automation-testing , command: git init, duration: 13.465183ms, output: Initialized empty Git repository in /Users/erik/.sourcegraph/repos/.tmp/patch-repo-696411267/.git/ 18:13:21 gitserver | INFO command ran successfully, prefix: 2 bitbucket.sgdev.org/SOUR/automation-testing , command: git init, duration: 10.256229ms, output: Initialized empty Git repository in /Users/erik/.sourcegraph/repos/.tmp/patch-repo-153787398/.git/ 18:13:21 gitserver | INFO command ran successfully, prefix: 1 gitlab.sgdev.org/sourcegraph/automation-testing , command: git reset -q 3b79a5d479d2af9cfe91e0aad4e9dddca7278150, duration: 11.265111ms, output: 18:13:21 gitserver | INFO command ran successfully, prefix: 2 bitbucket.sgdev.org/SOUR/automation-testing , command: git reset -q 1e256a405ec07c904f0a4e681c8136cc9fca3b87, duration: 9.065984ms, output: 18:13:21 gitserver | INFO command ran successfully, prefix: 1 gitlab.sgdev.org/sourcegraph/automation-testing , command: git apply --cached -p0, duration: 8.577269ms, output: 18:13:21 gitserver | INFO command ran successfully, prefix: 2 bitbucket.sgdev.org/SOUR/automation-testing , command: git apply --cached -p0, duration: 6.84489ms, output: 18:13:21 gitserver | INFO command ran successfully, prefix: 2 bitbucket.sgdev.org/SOUR/automation-testing , command: git commit -m Append Hello World to all README.md files, duration: 15.782106ms, output: [master 945ea51] Append Hello World to all README.md files 18:13:21 gitserver | 1 file changed, 1 insertion(+) 18:13:21 gitserver | INFO command ran successfully, prefix: 1 gitlab.sgdev.org/sourcegraph/automation-testing , command: git commit -m Append Hello World to all README.md files, duration: 18.903035ms, output: [master 8b15bcc] Append Hello World to all README.md files 18:13:21 gitserver | 1 file changed, 1 insertion(+) 18:13:23 gitserver | INFO command ran successfully, prefix: 2 bitbucket.sgdev.org/SOUR/automation-testing , command: git push --force https://milton:NDg0MTQ3NTE2OTIzOlKZVPDnk591b6NJwYRhvynMHwND@bitbucket.sgdev.org/scm/sour/automation-testing.git 945ea515c806f5665998b95a7dc09f23061e1773:refs/heads/hello-world, duration: 2.428522252s, output: remote: 18:13:23 gitserver | remote: Create pull request for hello-world: 18:13:23 gitserver | remote: https://bitbucket.sgdev.org/projects/SOUR/repos/automation-testing/compare/commits?sourceBranch=refs/heads/hello-world 18:13:23 gitserver | remote: 18:13:23 gitserver | To https://bitbucket.sgdev.org/scm/sour/automation-testing.git 18:13:23 gitserver | * [new branch] 945ea515c806f5665998b95a7dc09f23061e1773 -> hello-world 18:13:23 gitserver | INFO command ran successfully, prefix: 2 bitbucket.sgdev.org/SOUR/automation-testing , command: git update-ref -- refs/heads/hello-world 945ea515c806f5665998b95a7dc09f23061e1773, duration: 7.567189ms, output: 18:13:26 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:13:26 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:13:26 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:13:26 repo-updater | INFO Handled record, name: , id: 38 18:13:26 repo-updater | INFO Marked record as complete, name: , id: 37 18:13:26 repo-updater | INFO Handled record, name: , id: 37 18:13:33 syntect_server | 2020/09/04 16:13:33 request / http://127.0.0.1:33043 18:13:34 syntect_server | 2020/09/04 16:13:34 request / http://127.0.0.1:33043 18:13:36 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:13:36 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:13:36 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:13:36 repo-updater | INFO Handled record, name: , id: 38 18:13:38 syntect_server | 2020/09/04 16:13:38 request / http://127.0.0.1:33043 18:13:38 syntect_server | 2020/09/04 16:13:38 request / http://127.0.0.1:33043 18:13:42 syntect_server | 2020/09/04 16:13:42 request / http://127.0.0.1:42419 18:13:42 syntect_server | 2020/09/04 16:13:42 request / http://127.0.0.1:42419 18:13:46 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:13:46 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:13:46 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:13:46 repo-updater | INFO Handled record, name: , id: 38 18:13:56 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:13:56 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:13:56 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:13:56 repo-updater | INFO Handled record, name: , id: 38 18:14:01 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:14:01 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:14:01 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:14:01 repo-updater | INFO Handled record, name: , id: 38 18:14:06 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:14:06 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:14:06 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:14:06 repo-updater | INFO Handled record, name: , id: 38 18:14:11 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:14:11 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:14:11 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:14:11 repo-updater | INFO Handled record, name: , id: 38 18:14:21 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:14:21 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:14:21 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:14:21 repo-updater | INFO Handled record, name: , id: 38 18:14:31 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:14:31 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:14:31 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:14:31 repo-updater | INFO Handled record, name: , id: 38 18:14:41 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:14:41 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:14:41 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:14:41 repo-updater | INFO Handled record, name: , id: 38 18:14:51 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:14:51 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:14:51 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:14:51 repo-updater | INFO Handled record, name: , id: 38 18:14:56 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:14:56 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:14:56 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:14:56 repo-updater | INFO Handled record, name: , id: 38 18:15:04 gitserver | INFO command failed, prefix: 1 gitlab.sgdev.org/sourcegraph/automation-testing , command: git push --force https://git:Axusi-qWEEeNUBp-hv7s@gitlab.sgdev.org/sourcegraph/automation-testing.git 8b15bcc5ef306ce93ffb6ee15e75a6e59b998505:refs/heads/hello-world, duration: 1m42.896111205s, error: exit status 1, output: error: RPC failed; HTTP 524 curl 22 The requested URL returned error: 524 18:15:04 gitserver | fatal: the remote end hung up unexpectedly 18:15:04 gitserver | fatal: the remote end hung up unexpectedly 18:15:04 gitserver | Everything up-to-date 18:15:04 gitserver | ERROR Failed to push, ref: refs/heads/hello-world, commit: 8b15bcc5ef306ce93ffb6ee15e75a6e59b998505, output: error: RPC failed; HTTP 524 curl 22 The requested URL returned error: 524 18:15:04 gitserver | fatal: the remote end hung up unexpectedly 18:15:04 gitserver | fatal: the remote end hung up unexpectedly 18:15:04 gitserver | Everything up-to-date 18:15:04 repo-updater | WARN Marked record as errored, name: , id: 39, err: creating commit from patch for repository "gitlab.sgdev.org/sourcegraph/automation-testing": gitserver: pushing ref: exit status 1 18:15:04 repo-updater | ``` 18:15:04 repo-updater | $ git push --force https://git:@gitlab.sgdev.org/sourcegraph/automation-testing.git 8b15bcc5ef306ce93ffb6ee15e75a6e59b998505:refs/heads/hello-world 18:15:04 repo-updater | error: RPC failed; HTTP 524 curl 22 The requested URL returned error: 524 18:15:04 repo-updater | fatal: the remote end hung up unexpectedly 18:15:04 repo-updater | fatal: the remote end hung up unexpectedly 18:15:04 repo-updater | Everything up-to-date 18:15:04 repo-updater | ``` 18:15:04 repo-updater | INFO Handled record, name: , id: 39 18:15:06 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:15:06 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:15:06 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:15:06 repo-updater | INFO Handled record, name: , id: 38 18:15:11 web | ℹ 「wdm」: Compiling... 18:15:11 repo-updater | INFO Dequeued record for processing, name: , id: 39 18:15:11 repo-updater | INFO Reconciler processing changeset, changeset: 39, action: publish 18:15:11 repo-updater | INFO Dequeued record for processing, name: , id: 38 18:15:11 repo-updater | INFO Reconciler processing changeset, changeset: 38, action: publish 18:15:11 repo-updater | WARN Marked record as errored, name: , id: 38, err: cannot create changeset on the same branch in multiple campaigns 18:15:11 repo-updater | INFO Handled record, name: , id: 38 18:15:11 gitserver | INFO command ran successfully, prefix: 3 gitlab.sgdev.org/sourcegraph/automation-testing , command: git init, duration: 12.603589ms, output: Initialized empty Git repository in /Users/erik/.sourcegraph/repos/.tmp/patch-repo-142971176/.git/ 18:15:11 gitserver | INFO command ran successfully, prefix: 3 gitlab.sgdev.org/sourcegraph/automation-testing , command: git reset -q 3b79a5d479d2af9cfe91e0aad4e9dddca7278150, duration: 9.087691ms, output: 18:15:11 gitserver | INFO command ran successfully, prefix: 3 gitlab.sgdev.org/sourcegraph/automation-testing , command: git apply --cached -p0, duration: 8.225385ms, output: 18:15:11 gitserver | INFO command ran successfully, prefix: 3 gitlab.sgdev.org/sourcegraph/automation-testing , command: git commit -m Append Hello World to all README.md files, duration: 17.488747ms, output: [master 8b15bcc] Append Hello World to all README.md files 18:15:11 gitserver | 1 file changed, 1 insertion(+) ```
LawnGnome commented 4 years ago

I've been able to reproduce at least part of this:

  1. Publish a campaign that has a GitLab MR.
  2. Change the campaign spec to match a branch that already exists on the GitLab repo.
  3. Re-apply the campaign.

:boom:

I managed to get this crash from the backend:

18:17:00                          frontend |     id
18:17:00                          frontend |     namespaceName
18:17:00                          frontend |     url
18:17:00                          frontend | }
18:17:00                          frontend | VARIABLES
18:17:00                          frontend | ---------
18:17:00                          frontend | map[campaignSpec:Q2FtcGFpZ25TcGVjOiJBNUprS1hhZFZYVCI=]
18:17:00                      repo-updater | INFO Reconciler processing changeset, changeset: 42, action: close
18:17:00                      repo-updater | INFO Reconciler processing changeset, changeset: 41, action: close
18:17:00                      repo-updater | INFO Reconciler processing changeset, changeset: 40, action: close
18:17:00                      repo-updater | panic: runtime error: invalid memory address or nil pointer dereference
18:17:00                      repo-updater | [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x10abc99]
18:17:00                      repo-updater | goroutine 9627 [running]:
18:17:00                      repo-updater | github.com/sourcegraph/sourcegraph/cmd/repo-updater/repos.(*GitLabSource).CloseChangeset(0xc00061af40, 0x20db620, 0xc002155a10, 0xc002006780, 0x0, 0x0)
18:17:00                      repo-updater |    /home/adam/trees/sourcegraph/sourcegraph/cmd/repo-updater/repos/gitlab.go:347 +0x179
18:17:00                      repo-updater | github.com/sourcegraph/sourcegraph/enterprise/internal/campaigns.(*reconciler).closeChangeset(0xc0014014a0, 0x20db620, 0xc002155a10, 0xc001730950, 0xc002456000, 0x0, 0x0)
18:17:00                      repo-updater |    /home/adam/trees/sourcegraph/sourcegraph/enterprise/internal/campaigns/reconciler.go:285 +0x3c2
18:17:00                      repo-updater | github.com/sourcegraph/sourcegraph/enterprise/internal/campaigns.(*reconciler).process(0xc0014014a0, 0x20db620, 0xc002155a10, 0xc001730950, 0xc002456000, 0x0, 0x0)
18:17:00                      repo-updater |    /home/adam/trees/sourcegraph/sourcegraph/enterprise/internal/campaigns/reconciler.go:81 +0x670
18:17:00                      repo-updater | github.com/sourcegraph/sourcegraph/enterprise/internal/campaigns.(*reconciler).HandlerFunc.func1(0x20db620, 0xc002155a10, 0x20e4d00, 0xc001442780, 0x20b54a0, 0xc002456000, 0x0, 0x0)
18:17:00                      repo-updater |    /home/adam/trees/sourcegraph/sourcegraph/enterprise/internal/campaigns/reconciler.go:44 +0x105
18:17:00                      repo-updater | github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker.HandlerFunc.Handle(0xc00051b970, 0x20db620, 0xc002155a10, 0x20e4d00, 0xc001442780, 0x20b54a0, 0xc002456000, 0x0, 0x0)
18:17:00                      repo-updater |    /home/adam/trees/sourcegraph/sourcegraph/internal/workerutil/dbworker/handler.go:31 +0x7a
18:17:00                      repo-updater | github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker.(*handlerShim).Handle(0xc00051bb70, 0x20db620, 0xc002155a10, 0x20dbb20, 0xc001fbadc0, 0x20b54a0, 0xc002456000, 0x0, 0x0)
18:17:00                      repo-updater |    /home/adam/trees/sourcegraph/sourcegraph/internal/workerutil/dbworker/handler_shim.go:25 +0xbc
18:17:00                      repo-updater | github.com/sourcegraph/sourcegraph/internal/workerutil.(*Worker).handle(0xc0012c0870, 0x20dbb20, 0xc001fbadc0, 0x20b54a0, 0xc002456000, 0xe, 0x0)
18:17:00                      repo-updater |    /home/adam/trees/sourcegraph/sourcegraph/internal/workerutil/worker.go:190 +0x2ab
18:17:00                      repo-updater | github.com/sourcegraph/sourcegraph/internal/workerutil.(*Worker).dequeueAndHandle.func2(0xc0012c0870, 0x20b54a0, 0xc002456000, 0x20dbb20, 0xc001fbadc0)
18:17:00                      repo-updater |    /home/adam/trees/sourcegraph/sourcegraph/internal/workerutil/worker.go:164 +0xd4
18:17:00                      repo-updater | created by github.com/sourcegraph/sourcegraph/internal/workerutil.(*Worker).dequeueAndHandle
18:17:00                      repo-updater |    /home/adam/trees/sourcegraph/sourcegraph/internal/workerutil/worker.go:154 +0x83e

After restarting, the reconciler doesn't crash, but does leave the GitLab MR in the processing state. Manually kicking it back to queued results in the MR being reconciled as expected. This would imply that the crash isn't rolling back the transaction.

More to come tomorrow.

mrnugget commented 4 years ago

but does leave the GitLab MR in the processing state.

I think that's when the reset-stalled-job/MaxStalledAge mechanism of workerutils should kick in.

This would imply that the crash isn't rolling back the transaction.

I'm sure you've found this by now, but the dequeue-and-set-to-processing happens outside the transaction:

https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/workerutil/dbworker/store/store.go#L259-275

So if a worker crashes while it's processing a record the transaction that modified the record is rolled back (well, it's not committed), but the status is still set to processing. The new worker process that you boot up should reset the stalled jobs after MaxStalledAge

mrnugget commented 4 years ago

btw. continually retrying is a feature, not a bug, see https://github.com/sourcegraph/sourcegraph/issues/12700#issuecomment-685419544

davejrt commented 4 years ago

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.20 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly. When in doubt, reach out!

Thank you

LawnGnome commented 4 years ago

To be clear, this didn't restart after StalledMaxAge, so I'm going to investigate this further in 3.21, although the immediate issue was fixed (or at least obscured) by #13790.

LawnGnome commented 4 years ago

This hasn't been observed again, so deprioritising for now.

eseliger commented 3 years ago

Given that we had a couple of weeks to find this "by accident" and we didn't, I guess we can close this ticket? @sourcegraph/campaigns

chrispine commented 3 years ago

Yes, that's fine with me.