sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.12k stars 1.29k forks source link

Distribution 3.21 Tracking issue #13675

Closed pecigonzalo closed 4 years ago

pecigonzalo commented 4 years ago

Plan

Support new and existing deployments

This is an ongoing expense, we anticipate this taking no more than 10d of work spread across the entire team.

Support Security in deploying a log analysis tool

Security is planning to deploy a centralized logging and analysis system and will require our assistance to setup and review this new infrastructure.

Implement 2+ sourcegraph.com services using dhall

sourcegraph.com sees the highest amount of Kubernetes changes out of all of our deployments + deploy-sourcegraph. Scoping a single component limits the customizations that we need to implement and allows for easier onboarding other engineers.

Releases are created in a single day

We have a goal of reducing the time it takes to create releases, and this current several-day system has encouraged us to view releases as “baked” rather than “snapshots of the main branch”, leading to situations where main is broken and we have to retrospectively fix it or adding last minute features.

Split infrastructure into separate GCP projects

GCP utilizes project wide roles and permissions, to ensure resources are isolated from each other and reduce the blast radius of changes, we should split resources into separate projects. Additionally, this will grant us more insight into our infrastructure costs and will become more important as we grow and expand it.

Availability

Period is from September 20th to October 19th (21 working days). Please write the days you won't be working and the number of working days for the period.

Tracked issues

@unassigned: 5.00d

Completed: 5.00d

@bobheadxi: 8.50d

Completed: 8.50d

@davejrt

Completed

@daxmc99: 4.00d

@efritz

@ggilmore

Completed

@pecigonzalo: 23.00d

Completed: 23.00d

@slimsag: 15.00d

Completed: 14.50d

@uwedeportivo: 9.50d

Completed: 1.00d

Legend

davejrt commented 4 years ago

Last week Finalizing the work on e2e tests in vagrant. I had this working about 99% with what seemed like a minor issue, now I've run into something else that I need to track down where I've introduced a new bug. A classic case of it was working on my machine. Also lots of cleanup in the wake of secrets being being exposed

Next week Finalizing the e2e tests. I discussed with Gonza some thoughts I had about our release process and I plan to document those based on my experience in 3.20. I'd also like to clarify a few things regarding the way we stop the mistake I made happening again which I plan to document and circulate with the appropriate people.

slimsag commented 4 years ago

This week

This week I spent most of my time providing support internally and to customers, much more than usual. I did not make much headway against my planned work, but did add issues extensively for everything that came up to this milestone.

For customers, I provided extensive resource allocation advice to two major customers, and followed up extensively on ~7-8 more medium-sized customer issues before ultimately passing them off to other individuals or teams in order to reduce the number assigned to me.

Internally, I created a dev/testing managed instance and shared knowledge of them with the rest of the team in the form of updated docs, a recorded screencast, and improved tooling. I investigated ops issues with sourcegraph.com and multiple dev deployments with the team.

1:1s I had ran much longer than usual, leading to longer-form ongoing conversations. I also wrote a high-level progress summary on the Dhall work.

Next week:

I am hoping to be more heads-down and make substantial headway against my planned work, but acknowledge I have many more extensive conversations ahead of me which will be time consuming. Focus is key.

bobheadxi commented 4 years ago

Last week

This was an extra short week for me because I took one of the mental health day things. I got k8s.sgdev.org running smoothly, and helped a bit with migrating campaigns over from the old deployment. During this I found that the deploy-sourcegraph overlay for namespaces wasn't set up for cAdvisor, so I made a PR to add one and try and improve the docs around that a bit. Also found and fixed a bug in prom-wrapper that was causing custom alerts usernames to not be set correctly.

This week

I'm a little behind on getting started with 3.21 stuff so I'll be spending extra time this week to make up for that. I'll also ping #dev-chat to ask for objections about spinning down the old k8s.sgdev.org and go ahead and do that.

pecigonzalo commented 4 years ago

Last week I managed to move the CI and dogfood-server clusters to separate GCP projects. The dogfood-server cluster will reuse the dogfood-k8s GKE cluster, as its a single container.

This week Ill work on cleaning up and deleting leftover resources from the migration, and start the work to remove the bigdata cluster.

bobheadxi commented 4 years ago

This week

Deployed demo.sourcegraph.com - last step to this is awaiting #ce followup, and made some docs updates for managed instances while at it. Opened up a couple of PRs related to 1-day releases and reducing the steps required there. Discussed the future of Cloud deployment in this thread and RFC 239.

Next week

Find out who to ping for review for release-tool PRs (would still like one for https://github.com/sourcegraph/sourcegraph/pull/14240) and use that to start working out the rest of the tasks I've picked up for the 1-day releases project. Given the frequency of requests for clarification regarding Cloud deployments, would also like to help @daxmc99 if possible with polishing up RFC 239.

davejrt commented 4 years ago

Last week

e2e now running in a non-blocking capacity on main which I hope is now jsut a case of ironing out the last few bugs with some help from web ( I am confident in the infra and base image set up now). Helping out with a security scare, and the rest of my time was spent helping out on a big customer issue. Also a quick quality of life PR to manage aws service accounts with terraform. A bit of other troubleshooting here and there.

Next week

Finish e2e with the help of the web team and I am going to sync with uwe around regression testing and see how much different they are, and what effort is required to get that into a pipeline as well. I predict some significant time spent helping on customer issues too.

slimsag commented 4 years ago

This week

Was sick from Sat <-> Thu. On Friday I spent 90% of my time catching up on things, and did other minor work like adjusting 1password permissions for managed instances, helping to debug one customer issue, and investigating critical alerts at https://app.hubspot.com/contacts/2762526/company/407948923/

Next week

Hoping to get to what I did not this week, i.e. heads-down on my planned work with >=50% of my time.

uwedeportivo commented 4 years ago

this week

one quality of life issue (https://github.com/sourcegraph/sourcegraph/issues/13191) done, one dhall issue almost done (https://github.com/sourcegraph/sourcegraph/issues/14133), pitched in on token rotation and had debug sessions for customer issue

next week

all the stars will align and i will work on dhall code

ggilmore commented 4 years ago

this week:

next week:

pecigonzalo commented 4 years ago

Last week

I mostly worked on the GCP Split project, deleted BigCluster and cleanup disks, deleted Megakube and moved Tooling resources to the Dogfood cluster as they are used there (Phabricator, GHE, Bitbucket, Gitolite). This including porting a bunch infrastructure to Terraform. I have also been supporting and debugging https://github.com/sourcegraph/customer/issues/105 with @unknwon but we are currently waiting on the customer.

This week

Finish the Tooling cluster/resources move cleanup and update any relevant documentation. I need to switch back to updating our long-term goals, integrating the roadmap provided by Stephen into our goals and finishing the Distribution growth PR.

slimsag commented 4 years ago

This week

I played catch-up on PRs, reviews, etc. after being out sick last week. I followed-up on minor tasks, like setting up demo.sourcegraph.com with Robert and restructuring our 1password vaults. I had lots of 1:1 / career growth discussions, etc. I then began to hammer out my actual planned work, removing non-OSS syntax highlighting languages and creating a super extensive/tedious license report on syntect_server and dealing with some update pains/segfaults there. To finish off my week, I took a deep dive into the QA (formerly "e2e regression") test suite and pulled in others to help address 3 release blockers I identified in the process.

Next week

We are seeing lots of QA test suite failures, some of which look like real release-blocking regressions. I will be isolating those, filing issues, and pulling in more people to fix them. At the same time, I will be focused on 3.22 planning and working with Dave and Uwe to improve QA test suite reliability.

bobheadxi commented 4 years ago

Last week

Some small contributions to the CNCF repopage project: blackbox, CSS change to the logo. Landed improvements to changelog automation, deploy-sourcegraph release automation, and general release steps reductions and dry-runs for the release tool. Added support for regex silencing in observability.silenceAlerts. Investigated some k8s.sgdev.org prometheus failures and made a handbook update. Switched the default for NaN values in alerts to alleviate false alerts that have been firing on low-traffic instances like k8s.sgdev.org (and some customer test instances)

This week

Main thing I have in mind this week is keep an eye on the release process and see if any of the changes needs clarification/improvement

uwedeportivo commented 4 years ago

last week

this week

pecigonzalo commented 4 years ago

@uwedeportivo could you add which components?

davejrt commented 4 years ago

last week

this week

Hopefully customer issues will settle down and we can focus on internal issues. Last week uwe ran us through e2e/regression testing and I gained a lot of insight into what is infra related in the failures vs the tests themselves. This week the plan will be to get as much running as we can then identify what are issues for others teams to fix.

pecigonzalo commented 4 years ago

Last Week Finalized the GCP Split project, all resources are now on their appropriate projects and we closed https://github.com/sourcegraph/customer/issues/105.

This Week Focus on planning 3.22 and working with CE regarding incident management. Ill also be taking a day off which was going to take on the 12th but did not manage to. Ill continue to small resource cleanup as I find them in GCP and AWS.

ggilmore commented 4 years ago

Last week:

Next week:

slimsag commented 4 years ago

This week

A lot of conversations: changing my direction/focus, interviewing candidates, syncing with https://app.hubspot.com/contacts/2762526/company/407948923/ (alerts, upgrades, etc.) and https://app.hubspot.com/contacts/2762526/company/557692805/ (search, stability), syncing with Christina about state of product & opportunities.

A fair amount of time spent heads-down trying to debug/improve QA tests, but with few results. Its been hard for me to make progress here with lots of interruptions throughout my day and the test suite itself being so dang confusing (but also quite extensive.) I caught up with Uwe and did some pairing up on it with him.

Wanting to feel as though I made some progress other than just conversations, I switched away from QA tests mid-Thur and put my thoughts/questions around Cloud on paper, documented when to introduce new services, and merged some updates from Rob and Rijnard to improve syntax highlighting colors + add back GraphQL support.

Next week

Focus, get more heads-down time on QA tests and push the release through ASAP with Uwe and Dave.

bobheadxi commented 4 years ago

This week

Some last minute tweaks and adjustments to release process for 3.21 (both on the release tool, and the checklist), debugged the deploy-sourcegraph CI pipeline, reviewed some monitoring PRs after noticing some flakey critical alerts on k8s.sgdev.org

Next week

Keep tabs on release process, start exploring other parts of the release pipeline (e2e, etc) and the possibilities there. Will also be exploring our options with the upcoming deployment UX project meeting. Am also adjusting my work schedule a bit, but no major changes to meetings availabilities for the most part.

uwedeportivo commented 4 years ago

this week

Chased down a couple of issues with a big customer (https://github.com/sourcegraph/customer/issues/111, disk space distribution of index space, https://github.com/sourcegraph/customer/issues/116). Pitched in on release process by running regression tests and fixing them up. My Dhall language proposal hit a road block (https://github.com/dhall-lang/dhall-lang/issues/1081) :-). Still working on Dhall components, progress has not been as fast as I would like.

next week

Getting 3.21 out the door is priority for the beginning of the week. Afterwards I will probably go on vacation.

ggilmore commented 4 years ago

Last week:

This week:

pecigonzalo commented 4 years ago

Last week Most of last week has been reviewing RFCs (RFC-239: QA Environments, RFC-245: Centralized logging, RFC-249: Secret Management), PRs and other Slack conversations. The rest of it was planning the next sprint and how to track it with our current workflow.

This week I did not finish my review comments for RFC-239 so I would like to finish those and the plan for the next sprint. I will also sync about the delivery pipeline UX and create a goal for it.

davejrt commented 4 years ago

Last week

Fighting fires with uwe on a large customer (sourcegraph/customer#111) and really battling with regression tests. Uwe and Stephen have been a big help in digging through some of this with me. I have the infrastructure in a good working state, with automation now to setup the sourcegraph instance prior to running the tests. I am still confused as to why things don't work consistently between environments, and why some tests needs to be run twice in order to work.

Next week

Top priority will be to get 3.21 released, however the regression tests are run (local or in CI). After that a write up that really identifies where the gaps are, what is broken and what can be automated.

daxmc99 commented 4 years ago

Last Week

Vacation 🌴 🚵‍♂️

This Week

Finish up remaining Cloud SQL work https://github.com/sourcegraph/sourcegraph/issues/11496, investigate deployment pipeline UX and report back to Cloud team with our decisions.