Closed asdine closed 2 years ago
Heads up @tsenart - the "team/cloud" label was applied to this issue.
@tsenart Do we want to disable it temporarily or permanently? If it is permanently, there is an ongoing discussion that advocates for the opposite approach
cc @pecigonzalo
@asdine: The idea is to disable https://buildkite.com/sourcegraph/deploy-sourcegraph-dot-com/settings/schedules/71114888-b8b3-4719-bf87-785ccfd621a0, so that whatever we do manually in production during an incident doesn't get overriden.
@tsenart But do we want to disable it now or only when we need have an incident? While discussing with @pecigonzalo, it appeared that it is preferable to only do that whenever we want to debug an incident, otherwise we would end up with too many manual changes on that the deploy-sourcegraph-dot-com config.
While discussing with @pecigonzalo, it appeared that it is preferable to only do that whenever we want to debug an incident, otherwise we would end up with too many manual changes on that the deplou-sourcegraph-dot-com config.
Every time we do proper deploy through a git commit to the deploy-sourcegraph-dot-com repo, we'd erase the manual changes. Are we so concerned with a high frequency of manual changes that we can't rely on the normal deploys from real commits to auto-correct uncommited changes?
@tsenart I think that is a good point as we regularly deploy due to changes on sourcegraph/sourcegraph
but we have no control over when those deployments trigger as they depend on changes to sourcegraph/sourcegraph
. Right now we ensure every hour that it is in sync. I had not consider this when chatting with @asdine and yourself before.
Checking the repo history at the moment, there are 5h+ periods without commits which would be equal to no deployments and I have not way right now of reporting when manual changes were done.
I could as well reverse that question, are we going to need urgent manual changes often? (and we want to do it permanently that schedule as a lot of overhead). 🤔
Overall, I dont have major concerns with this given changes to that repo are fairly frequent, maybe we can change it to run on @daily at mid day as fallback. I would share this in #dev-chat with @distribution cc to make sure Im not missing something here.
Daily sounds good to me.
I would personally advocate for the hourly deploys (or more every frequently perhaps?). Im not able to tell from this thread what the idea behind daily deploys is, is it to give on-call more time to fix production issues before the hourly deploy undoes changes?
I feel like changing deploy frequency is targeting the wrong problem, rather that our processes, tooling and (more long/medium term) the general culture of observability et al should be addressed e.g. make it trivial to freeze new deployments during incidents, consolidate on a company-wide set of tools/services (honeycomb, grafana etc) that all engineers are well versed/trained in etc and properly utilized.
On top of this, I feel like batching up more commits per deploy would be detrimental to debugging production issues as we scale, increasing the number of commits to narrow down to a/the problematic in the case of incidents and increasing the time of the feedback loop. Id much prefer to be able to find out within 30m/1hr whether my changes cause issues at cloud-scale and have a shorter time of finding the problematic commit/PR between the x number of commits merged in that time rather than waiting 24hrs to then have to comb through y (where x < y) commits. Theres potentially a lot of expensive context switching going on there, itd be much better to still have the mental model and context fresh in my mind.
Id be happy to try address any counter-points to this : )
My current article of preference on this is from Charity Majors.
I tend to agree with @Strum355 on this, and this comment from the linked article is what resonates with me the most.
The number one thing you can do to make your deploys intelligible, other than observability and instrumentation, is this: deploy one changeset at a time, as swiftly as possible after it is merged to master. NEVER glom multiple changesets into a single deploy — that’s how you get into a state where you aren’t sure which change is at fault, or who to escalate to, or if it’s an intersection of multiple changes, or if you should just start bisecting blindly to try and isolate the source of the problem. THIS is what turns deploys into long, painful marathons.
It looks like we have made the change to do hourly deploys. Can this be closed? (Ignore, no change made, see below )
It looks like we have made the change to do hourly deploys. Can this be closed?
It was hourly already when this was opened.
I would personally advocate for the hourly deploys (or more every frequently perhaps?). Im not able to tell from this thread what the idea behind daily deploys is, is it to give on-call more time to fix production issues before the hourly deploy undoes changes? [...]
While I generally agree with everything you said here @Strum355 , the objective of the hourly deployments is not to deploy a bulk of changes, those are deployed by different events (commits to main
). They are still not granular to per commit to main
in sourcegraph/sourcegraph
as a group of changes can be bulked by Renovate in sourcegraph/deploys-sourcegraph-dot-com
.
The hourly changes are there to revert any manual actions done in production and ensure its always in sync with the actual code.
While the topics are related I believe we are mixing the two things here and it leads to the wrong conclusion.
While the topics are related I believe we are mixing the two things here and it leads to the wrong conclusion.
I think thats the case yea, from reading the discussion I was under the impression we were talking about the deployment, my bad : )
I'm gonna change the frequency to daily tomorrow, if anyone is against it please raise your voice! 🙌🏼
Heads up @daxmc99 @JenRed777 @danieldides - the "team/devops" label was applied to this issue.
Steps to do this are in the handbook
Disable auto deploy of latest main in deploy-sourcegraph-dot-com so we can do manual changes and not have CI interfere during incidents.