research-software-reactor / code-testing-for-researchers

Supporting in-cloud testing for researchers
Apache License 2.0
2 stars 0 forks source link

Case study: Source code management problems after adding many more cloud workers to a testing platform #4

Open tmbgreaves opened 5 years ago

tmbgreaves commented 5 years ago

To solve problems with long build queue delays, we added Jenkins capacity to burst up to fifty additional cloud workers. We then saw very unreliable connections to our source code management (SCM) platform - in this case, github - which appeared to be problems with failed access tokens.

We initially assumed Jenkins-end problems but later realised that our access tokens were being disabled for on the order of 48h at the github end, presumably in response to a large number of concurrent SCM requests which looked like a distributed denial-of-service (DDOS) attack.

This was probably more severe in our research software case as each job could be running in five or six configurations, and each configuration could be doing multiple clone actions. In the case of a PR'ed branch the total would double with builds of the branch and the merge. Before adding cloud capacity, the load would have been spread over a longer time as our smaller worker pool processed the backlog.

Flagging this as a potential issue could be helpful for future research groups, both in terms of preventing the problem (don't allow so many on-demand instances to start at once), stopping it affecting multiple projects (giving each its own access token), and awareness of the problem to save a long debugging process.