Create PullRequest ends with 500 Internal Server Error after ~60s with 'context canceled' events in v1.20.6 (and previously in v1.18.5)

zcopado commented 2 months ago

Description

We have found that various Pull Requests are failing with a 500 Internal Server Error after >60s of executing the CompareAndPullRequestPost (pull.go).

Here are the typical, overall events we get:

2024/08/07 18:06:44 ...s/process/manager.go:188:Add() [T] Start 66b3b7b4: POST: /ATrain/ATrainProd/compare/QA...feature/US-0000350 (request)
2024/08/07 18:06:44 ...eb/routing/logger.go:47:func1() [T] router: started   POST /ATrain/ATrainProd/compare/QA...feature/US-0000350 for 10.0.0.76:53958
2024/08/07 18:06:47 ...eb/routing/logger.go:68:func1() [W] router: slow      POST /ATrain/ATrainProd/compare/QA...feature/US-0000350 for 10.0.0.76:53958, elapsed 3244.2ms @ repo/pull.go:1173(repo.CompareAndPullRequestPost)
2024/08/07 18:07:44 ...eb/routing/logger.go:102:func1() [I] router: completed POST /ATrain/ATrainProd/compare/QA...feature/US-0000350 for 10.0.0.76:53958, 500 Internal Server Error in 60207.1ms @ repo/pull.go:1173(repo.CompareAndPullRequestPost)

In between these anchor events are a myriad of "merge" events, as follows:

2024/08/07 18:07:42 ...s/process/manager.go:188:Add() [T] Start 66b3b7ee-3: git(dir:/data/tmp/local-repo/pull.git3245706206): /usr/bin/git -c protocol.version=2 -c credential.helper= unpack-file aaf833fe8bae4d3888628dcffb756c7ff6b96c19 (from 66b3b7b4) (normal)
2024/08/07 18:07:42 ...s/process/manager.go:231:remove() [T] Done 66b3b7ee-3: git(dir:/data/tmp/local-repo/pull.git3245706206): /usr/bin/git -c protocol.version=2 -c credential.helper= unpack-file aaf833fe8bae4d3888628dcffb756c7ff6b96c19
2024/08/07 18:07:42 ...dules/git/command.go:281:Run() [D] git.Command.RunDir(/data/tmp/local-repo/pull.git3245706206): /usr/bin/git -c protocol.version=2 -c credential.helper= unpack-file fd79dfd6bd6a9bd237522ab6ab995bd8da885104
2024/08/07 18:07:42 ...s/process/manager.go:188:Add() [T] Start 66b3b7ee-4: git(dir:/data/tmp/local-repo/pull.git3245706206): /usr/bin/git -c protocol.version=2 -c credential.helper= unpack-file fd79dfd6bd6a9bd237522ab6ab995bd8da885104 (from 66b3b7b4) (normal)
2024/08/07 18:07:42 ...s/process/manager.go:231:remove() [T] Done 66b3b7ee-4: git(dir:/data/tmp/local-repo/pull.git3245706206): /usr/bin/git -c protocol.version=2 -c credential.helper= unpack-file fd79dfd6bd6a9bd237522ab6ab995bd8da885104
2024/08/07 18:07:42 ...dules/git/command.go:281:Run() [D] git.Command.RunDir(/data/tmp/local-repo/pull.git3245706206): /usr/bin/git -c protocol.version=2 -c credential.helper= merge-file /data/tmp/local-repo/pull.git3245706206/.merge_file_CdGFfJ .merge_file_iBlLhG .merge_file_PdLANN
2024/08/07 18:07:42 ...s/process/manager.go:188:Add() [T] Start 66b3b7ee-5: git(dir:/data/tmp/local-repo/pull.git3245706206): /usr/bin/git -c protocol.version=2 -c credential.helper= merge-file /data/tmp/local-repo/pull.git3245706206/.merge_file_CdGFfJ .merge_file_iBlLhG .merge_file_PdLANN (from 66b3b7b4) (normal)
2024/08/07 18:07:42 ...s/process/manager.go:188:Add() [T] Start 66b3b7ee-5: git(dir:/data/tmp/local-repo/pull.git3245706206): /usr/bin/git -c protocol.version=2 -c credential.helper= merge-file /data/tmp/local-repo/pull.git3245706206/.merge_file_CdGFfJ .merge_file_iBlLhG .merge_file_PdLANN (from 66b3b7b4) (normal)
2024/08/07 18:07:43 ...rvices/pull/patch.go:273:AttemptThreeWayMerge() [T] Conflict: force-app/main/default/profiles/Salesforce User A.profile-meta.xml in PR[0] Atrain/AtrainProd#0
2024/08/07 18:07:43 ...rvices/pull/patch.go:131:attemptMerge() [T] Attempt to merge:
1 100644 force-app/main/default/profiles/Salesforce User A.profile-meta.xml 3417b57828da9071337396ed67a28a1e0382705d
2 100644 force-app/main/default/profiles/Salesforce User A.profile-meta.xml cfd2597620a7fe8e375ea6f1533ac3f5d4a1b302
3 100644 force-app/main/default/profiles/Salesforce User A.profile-meta.xml 85be4b318251fa3a1b44baf79b17c77acb9744f0
2024/08/07 18:07:43 ...dules/git/command.go:281:Run() [D] git.Command.RunDir(/data/tmp/local-repo/pull.git3245706206): /usr/bin/git -c protocol.version=2 -c credential.helper= unpack-file 3417b57828da9071337396ed67a28a1e0382705d
2024/08/07 18:07:43 ...s/process/manager.go:188:Add() [T] Start 66b3b7ef: git(dir:/data/tmp/local-repo/pull.git3245706206): /usr/bin/git -c protocol.version=2 -c credential.helper= unpack-file 3417b57828da9071337396ed67a28a1e0382705d (from 66b3b7b4) (normal)
2024/08/07 18:07:43 ...ll/patch_unmerged.go:162:unmergedFiles() [T] Got line: 2 100644 force-app/main/default/profiles/Salesforce User A.profile-meta.xml 59c3d866539d78cae63a0408580fd2e2724e7d41 Current State:
1 100644 force-app/main/default/profiles/Salesforce User A.profile-meta.xml 17342757b30e2abfb300cd383dffca51bbad6301
<nil>
<nil>
2024/08/07 18:07:43 ...ll/patch_unmerged.go:162:unmergedFiles() [T] Got line: 3 100644 force-app/main/default/profiles/Salesforce User A.profile-meta.xml 3139d19367a48edfcb9703155a98ea56750ceb39 Current State:
1 100644 force-app/main/default/profiles/Salesforce User A.profile-meta.xml 17342757b30e2abfb300cd383dffca51bbad6301
2 100644 force-app/main/default/profiles/Salesforce User A.profile-meta.xml 59c3d866539d78cae63a0408580fd2e2724e7d41
<nil>

And here are the [E]rror events RIGHT BEFORE the final 500 Internal Server Error event:

2024/08/07 18:07:44 ...ll/patch_unmerged.go:162:unmergedFiles() [T] Got line: 0    git ls-files -u -z: context canceled Current State:
1 100644 force-app/main/default/triggers/OpportunityPartner_AfterInsertAfterUpdateBeforeDelete.trigger-meta.xml 5a0e2a58b7f57011dbb9235ebef40f54a24a7ab6
<nil>
3 100644 force-app/main/default/triggers/OpportunityPartner_AfterInsertAfterUpdateBeforeDelete.trigger-meta.xml 5a0e2a58b7f57011dbb9235ebef40f54a24a7ab6

2024/08/07 18:07:44 ...ll/patch_unmerged.go:164:unmergedFiles() [E] Unable to run ls-files -u -z! Error: git ls-files -u -z: context canceled
2024/08/07 18:07:44 ...ers/web/repo/pull.go:1297:CompareAndPullRequestPost() [E] NewPullRequest: unable to get base object: 59c3d866539d78cae63a0408580fd2e2724e7d41 at path: force-app/main/default/profiles/Salesforce User A.profile-meta.xml for merging. Error: signal: killed
2024/08/07 18:07:44 .../context_response.go:74:HTML() [D] Template: status/500
2024/08/07 18:07:44 ...eb/routing/logger.go:102:func1() [I] router: completed POST /ATrain/ATrainProd/compare/QA...feature/US-0000350 for 10.0.0.76:53958, 500 Internal Server Error in 60207.1ms @ repo/pull.go:1173(repo.CompareAndPullRequestPost)

The interesting part of this is that we UPGRADED this instance TODAY to v1.20.6, and it did not fix this problem.

Prior to the upgrade, we were running v1.18.5, and had the same results (with some different events BEFORE the end):

2024/08/07 09:42:47 ...es/pull/temp_repo.go:61:createTemporaryRepo() [E] [66b3414f] git init tmpBasePath: context canceled
2024/08/07 09:42:47 ...vices/pull/update.go:161:GetDiverging() [E] [66b3414f] CreateTemporaryRepo: context canceled
2024/08/07 09:42:47 ...ers/web/repo/pull.go:1268:CompareAndPullRequestPost() [E] [66b3414f] NewPullRequest: context canceled
2024/08/07 09:42:47 [66b3414f] router: completed POST /ATrain/ATrainProd/compare/QA...feature/US-0000350 for 10.0.0.114:34610, 500 Internal Server Error in 72265.0ms @ repo/pull.go:1143(repo.CompareAndPullRequestPost)

We need your help. We have scoured your Issues lists, Discord, etc, and cannot locate anything that would help us here.

We are looking for anything to explain the context canceled events, and how we can extend "what seems to be" a 60s timeout for the Pull Request creation, or the context lifetime.

Thank you for ALL your help.

Gitea Version

1.20.6

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Screenshots

No response

Git Version

2.40.1

Operating System

alpine 3.18

How are you running Gitea?

In a GKE v1.28+ Cluster

Successfully running gitea since v1.16+

Database

PostgreSQL

KN4CK3R commented 2 months ago

What are the values for git timeout in your ini? https://docs.gitea.com/administration/config-cheat-sheet#git---timeout-settings-gittimeout

zcopado commented 2 months ago

Good question! We have not altered any of the DEFAULTS. So whatever the defaults are, as per the app.example.ini, that's what we have.

And funny enough, none of them, other than GC (garbage collection) elude to a 60s timeout. Hence why we are perplexed.

KN4CK3R commented 2 months ago

Still you could add higher values to your config and test if it works. You can see the current settings on the admin panel /admin/config. Just to be sure there are no other values set at the moment.

zcopado commented 2 months ago

Good point. What timeout would affect this action? None of them seem to have a clear correlation...

KN4CK3R commented 2 months ago

It's the default value.

zcopado commented 2 months ago

Alright... that value is set to 360 seconds (6 minutes). This operation is failing at ~60s... what value should we tweak here?

wxiaoguang commented 2 months ago

Maybe you are using a reverse proxy like nginx and its default timeout is also 60s

zcopado commented 2 months ago

Great point, wxiaoguang... we checked, and our ingress has a 300s timeout. Additionally, can someone with more familiarity with the 'context' in gitea explain what causes the 'context canceled'?

Perhaps it is a timeout of the context, in regard to PullRequest, that is causing this issue.

wxiaoguang commented 2 months ago

context canceled is caused by the "Context is cancelled before the operation finishes". It could occur:

The network connection is closed (because CompareAndPullRequestPost and NewPullRequest use the ctx from the reuqest, aka http web connection). So if the connection is closed, the context is canceled
The deadline (timeout) has been reached. Usually it is controlled by some "timeout" config options. But your log shows CompareAndPullRequestPost() [E] [66b3414f] NewPullRequest: context canceled, that's why I guess it is caused by reverse proxy's read/write timeout -- the default value is 60s.

wxiaoguang commented 2 months ago

I think there is an easy way to to confirm whether it is really caused by the network connection, when the error occurs:

if you see the Gitea's 500 error page, then it is caused by Gitea's internal config option
If you see a non-Gitea page (eg: you see the nginx or reverse proxy's error page or blank page or browser's error page ), then it is caused by network / reverse proxy, not by Gitea.

zcopado commented 2 months ago

Great points @wxiaoguang ... thank you for the insight. We will dig deeper on our end, and report back our findings. Truly appreciate the clarity.

zcopado commented 2 months ago

@wxiaoguang - Your insight was correct. In our ingress configuration (utilizing nginx), there was an implied setting with a default value:

Syntax: | proxy_read_timeout time;
----
proxy_read_timeout 60s; #default
http, server, location

(more info)

We explicit set that directive, with a value of 300s, and Boom! Everything worked as expected.

Closing this Issue. Cheers.

go-gitea / gitea