project-everest / everest-ci

CI scripts for project everest
3 stars 8 forks source link

VSTS does not perform a build for a pull request... #51

Closed msprotz closed 7 years ago

msprotz commented 7 years ago

... resulting in a pull request left in limbo.

From Slack:

https://github.com/FStarLang/FStar/pull/825#issuecomment-275231690

[1:29]
Visual Studio Team Services Build — Waiting for status to be reported

[1:29]
but when I go to VSTS:

[1:29]
Queued or running No builds queued or running at the moment

Any ideas on how should we resolve this? Should we assign someone to the task of watching the pull requests, and manually starting builds for those pull requests that VSTS didn't pick up?

Please note that GitHub is NOT saying: "build not performed". It's saying "Waiting for status to be reported", meaning that the original author of the pull request has no way of knowing that the status will never be reported.

Thanks,

Jonathan

darrenge commented 7 years ago

not sure what happened there ... I see that Nik's run is going so it found it there. Your build #4011 ran about 1:40 so did you restart it?

msprotz commented 7 years ago

I closed the pull request and merged it, but the build never happened on the original pull request... there's no bullet next to the individual commits: compare https://github.com/FStarLang/FStar/pull/823/commits (green check mark, VSTS correctly reported status on this pull request) and https://github.com/FStarLang/FStar/pull/825/commits (none of the commits ever was built by VSTS).

This has been a recurring problem, and there's "folk knowledge" that you often have to push an empty commit to get VSTS to report the status on your github pull request.

If we ever manage to understand why VSTS is failing us, that'd be helpful.

darrenge commented 7 years ago

1) I have questions out on forums and to some internal VSO guys to see if they have ways to troubleshoot. The internal VSO PM did not have an answer for this. 2) I noticed that our CI trigger was set to "Batch Changes" which is used if there are a lot of check ins at same time it will batch them up into one CI run. Maybe there was something funky where it was batching things or something. I turned that off so it will do a CI run for every push instead of batching. We'll see if that helps.

I will keep this open as a reminder to check with the team on if this is happening or not for future runs.

darrenge commented 7 years ago

Still no luck on finding out why this is happening. Emailed the VSO team PMs and have not heard back from them yet.

darrenge commented 7 years ago

From MSDN forum: "If it is intermittent I would expect that we simply are not getting the web hook event from GitHub or if we are that the data is incomplete.  We have seen this quite a bit recently as we were working on pull requests where we would get an event with no data or without the merge commit."

They are going to look into our build stuff directly.

darrenge commented 7 years ago

Still waiting for VSTS team to do their investigation. I have pinged them.

darrenge commented 7 years ago

VSTS team are seeing errors in their logs for our account regarding service hooks and github. They are investigating those errors to see if there is a fix on their side or on our side.

darrenge commented 7 years ago

Update from VSTS team: For this E2EID we can see the request being throttled on TFS – and not getting delivered to the Service Hook (SH). I can also see that the circuit breaker was triggered because TFS was being spammed by invalid requests from the ONEDRIVE account at that time. I’m still investigating why the deployment level circuit breaker was being triggered by the requests coming from the ONEDRIVE account. I can also see that all the throttling was localized to one AT instance.

I’ve included a query below that shows SH successfully processed GitHub events from this account a minute before and two minutes after this error. There are no errors on the SH side that occurred for this account in the week around this time.

Thanks, Rick

ProductTrace | where E2EID == "ffe73a7b-6628-4ac5-883e-b2c82b3c257a"

Shows this message in TFS at 2017-02-15 23:38:56.1315343: Microsoft.VisualStudio.Services.CircuitBreaker.CircuitBreakerShortCircuitException: Circuit Breaker "HttpClientThrottler-ServiceHooksManagementHttpClient-app.vssh.visualstudio.com"

ActivityLog | where Service == "sh" | where HostId == "1242fa66-b132-4bb9-a608-b6b99e20eebc" | where PreciseTimeStamp > datetime(2017-02-15 23:36:56.1343839) | where PreciseTimeStamp < datetime(2017-02-15 23:40:56.1343839) | where Command == "HooksSvcEvents.CreateEvents"

Shows two successful HooksSvcEvents.CreateEvents command arriving at Service Hooks for this account (TFS delivers the GitHub event payload to SH for processing).

darrenge commented 7 years ago

Got more info on it. Summary is that VSTS guys found an account that was spamming VSTS with invalid requests to the point where a "circuit breaker" was popped causing any incoming requests to be overlooked (which is why our github was overlooked). It might have been caused by a team that is working on monitoring internal VSTS projects and it just went too far.

Details from VSTS expert: I don’t have a definitive answer as to why the onedrive account was spamming VSTS with invalid requests, but I do know that those invalid requests were occurring at such a high rate that it tripped a circuit breaker in one the VSTS application tier instances (AT2). All requests arriving at that application tier while the breaker was open were dropped. The request to publish the GitHub push event within VSTS was one of those requests which were rejected – and resulted in a build not getting triggered for that push. I checked with the framework team to understand if the circuit breaker should have impacted requests from other accounts and I was told it is designed to reject all incoming requests, so it was working as designed.

I’ve been in touch with the ASG Engineering Intelligence Metrics team regarding a different scenario where VSTS was being spammed by invalid requests from the msazure and msdata accounts (in this scenario the failure rate was not high enough to trip a breaker). That team is working on tooling to monitor internal Microsoft projects in VSTS and acknowledged an issue with the tooling they were developing in that scenario. I suspect, but haven’t verified, that the failed requests from the ondrive account are part of the same tooling effort.

darrenge commented 7 years ago

This isn't anything we can really do about it as it is internal VSTS team possibly causing it. With that known, I will close out this Issue as it is an "out of our control environment issue".