sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.11k stars 1.27k forks source link

Distribution: 3.19 Tracking issue #11954

Closed pecigonzalo closed 4 years ago

pecigonzalo commented 4 years ago

Plan

Support new and existing deployments

This is an ongoing expense, we anticipate this taking no more than 10d of work spread across the entire team.

Reduce upgrade overhead

Upgrading Kubernetes deployments requires customers spend a lot of engineering time to converge our released Kubernetes manifests with their fork as documented inΒ RFC-141.

We will finish the Dhall investigation and make a decision by the end of 3.19.

Increase our e2e test frequency

To increase our release cadence, we need to be able to run e2e tests more frequently. This is currently not possible as our CI infrastructure causes tests to be unreliable.

Support per-team alerts

To allow teams to support and monitor the services and features they ship, we need to be able to route alerts to the relevant teams as described in RFC-189.

Availability

Period is from July 20th to August 19th (23 working days). Please write the days you won't be working and the number of working days for the period.

Workload

@bobheadxi: 2.50d

@davejrt

@daxmc99

@efritz

@ggilmore

@keegancsmith

@pecigonzalo

@slimsag: 7.50d

@uwedeportivo: 8.00d

Legend

pecigonzalo commented 4 years ago

cc/ @christinaforney @dadlerj

bobheadxi commented 4 years ago

this week (week of jul 13)

Landed a range of improvements to alerting for 3.18 (alert silencing, notifications, unifying some out-of-band alerts). Worked on justifying and fleshing out some details for supporting alert ownership for RFC189, planned tasks for 3.19, and am currently working on documenting the what, how, and why of our monitoring stack (https://github.com/sourcegraph/about/pull/1221). Also paired with @keegancsmith on potential improvements to one of our alerts and the generator as a whole.

next week

Hopefully resolve https://github.com/sourcegraph/sourcegraph/issues/12158, since it is currently one of the more frequent critical alerts that isn't entirely actionable/valid on Cloud, and wrap up https://github.com/sourcegraph/sourcegraph/issues/5370 by setting up a time to take on-call and set up OpsGenie. Also finalize how we are supporting alert ownership for RFC189 and and finish documenting our monitoring stack

davejrt commented 4 years ago

Week July 13

Landing blackbox exporter into sourcegraph.com environment sourcegraph/deploy-sourcegraph-dot-com#2984 Lots of time spend on calls with $CUSTOMER working through issues with their deployment related to indexed search starting. Also implemeted the fix suggested by @pecigonzalo regarding pv/pvc and their respecting claimref from namespaces.

Week July 20 Continuing to work through $CUSTOMER issues and beginning on #12101 in addition to fine tuning some of the blackbox alerts to ensure they reflect exactly what we need. These seemed to function well though in light of lasts weeks cloudflare outage. Continuing to work with $CUSTOMER to resolve any further issues

pecigonzalo commented 4 years ago

Week July 13

I worked on planning 3.19 which included starting this experiment for tracking project progress. I have also started RFC-202 for standardizing configuration across our services.

Week July 20

My focus this week be setting our team goals and planning the retrospective for 3.18. Ill also start experimenting with using projects to track unplanned tasks and close old backlog items/projects.

uwedeportivo commented 4 years ago

pretend this is last friday:

this week:

mainly release with dax and i finished all my dhall assignments from geoffrey (i used geoffrey's nicely done framework to capture in dhall the customizations we need to support dog-food k8s)

next week:

3.19 planning, some small remainders from release process, continue with geoffrey on dhall (we will probably start converting dog-food and start capturing all the kustomize customizations and twitter and apple customization)

ggilmore commented 4 years ago

This past week, I lost some time due medical issues and follow up. The rest of the week was spent catching up on rescheduled meetings, more medical follow up, following up with @uwe's good work on deploy-sourcegraph-dhall, and 3.19 planning

This week:

slimsag commented 4 years ago

Copying my update over from the 3.18 tracking issue for posterity / visibility:

Last week

I spent ~60% of my time on https://github.com/sourcegraph/customer/issues/62#issuecomment-661266320 and made great progress but with many context switches / interruptions throughout. I was sidetracked regularly by::

This week

I intend to:

pecigonzalo commented 4 years ago

Priorities update

As discussed with CE https://github.com/sourcegraph/customer/issues/65 is now our top priority

davejrt commented 4 years ago

Week July 20 - Last week

ggilmore commented 4 years ago

This week:

Next week: I should have fewer medical appointments, and I'll continue dedicating my time to investigating https://github.com/sourcegraph/customer/issues/65 until that's solved

uwedeportivo commented 4 years ago

Week July 20 - Last week

Week July 27- Next week

bobheadxi commented 4 years ago

Week of July 20

Added the first step towards supporting per-team alerts, worked on migrating custom alerts into our generator, investigated some provisioning issues on dot-com, investigated collecting IO metrics to support https://github.com/sourcegraph/customer/issues/65, and added support for for: duration in our alerts

Next week

Wrap up outstanding PRs and implement support for per-team alerts + dogfood it alongside the migrated custom alerts

slimsag commented 4 years ago

Last week

I spent the vast majority of my time helping with https://github.com/sourcegraph/customer/issues/65 and nearly completed https://github.com/sourcegraph/customer/issues/62 (literally only a few hours away).

This week

I hope to finish https://github.com/sourcegraph/customer/issues/62 , the distribution roadmap, the monitoring architecture documentation, and catch up with Robert on monitoring.

pecigonzalo commented 4 years ago

Week July 20

Last week focus has been working with the team to set our team goals. The test around using GitHub projects for tracking progress seems to be working and ill continue with this during the rest of the iteration. I have been also talking with Chayim about the secrets loading implementation.

Week July 27

Ill continue to focus on setting our team goals, we have settled on them but we are still working out the details. I will also try to finalize RFC-202 and the review of RFC 199.

Team update

Issue https://github.com/sourcegraph/customer/issues/65 has been resolved, but our focus remains on the sub-issues created by it https://github.com/sourcegraph/customer/issues/69 and https://github.com/sourcegraph/customer/issues/70 for this week.

slimsag commented 4 years ago

This week

I helped $CUSTOMER with Uwe and Dave to ensure their demo went smoothly, completely finished work on managed instances and am ready to ship it to $MY_CUSTOMER_0, caught up with Robert on monitoring and next steps there, shipped Sourcegraph to $MY_CUSTOMER_1, began data collection for https://github.com/sourcegraph/customer/issues/71 - and started looking into release automation for deploy-sourcegraph-docker as well as brainstorming production readiness ideas.

I did not complete/merge the distribution roadmap or monitoring architecture documentation, but still intend to do so.

Next week

Review https://github.com/sourcegraph/sourcegraph/pull/12581 - merge the distribution roadmap, architecture docs, get feedback on production readiness and move forward on that, continue looking into monitoring and release automation.

bobheadxi commented 4 years ago

This week

I've been working on a range of improvements and polish to alerting (formatting, bug, regression, migration, testing, etc).

I've also landed the core functionality for the RFC 189's per-team alerts (via routing implementation) and have prepared a pull request to configure team-based paging. The final pieces of this are also up for review: migrate the rest of our out-of-band alerts and drop our custom alerting.

I commented on this last week, but I've hit a wall with cadvisor IO metrics (https://github.com/sourcegraph/sourcegraph/issues/12163) and don't really see a way forward - my update on that issue includes possible alternatives

Next week

Land everything related to per-team alerts and work with each team to get rotations and alerts set up (per @nicksnyder's request). I imagine there will be problematic alerts / other issues, and will likely focus on follow-up work. This will (finally!) close out dogfooding.

ggilmore commented 4 years ago

This week:

davejrt commented 4 years ago

THIS WEEK

I'd say a fairly even portion of my time was spent between working on tasks for $customer and replicating their setup internally. I haven't gotten back around to do anything with Dhall which is unfortunate but I hope to make more progress on that next week. I made a decent start on #12101 and a working agent that aside from needing some fine tuning will be good to start using and iterate on.

I did take a tertiary glance at removing alertmanager but Robert kindly informed me that this was wrapped up with a bunch of other PRs he'll land next week so it's in his more than capable hands for now.

NEXT WEEK

I'll have a PR ready to land for the baremetal CI agent which will be ready to start running jobs. I'm going to resync with geoffrey and/or uwe re dhall and make an effort to pick that back up again.

uwedeportivo commented 4 years ago

this week:

tinkered together with geoffrey on dhall (https://github.com/sourcegraph/deploy-sourcegraph-dhall/tree/the_rest_generate). we're in contact with the dhall core devs about some issues we hit and we also asked them for advice on how to set up the dhall interface for our customers (https://github.com/dhall-lang/dhall-haskell/issues/1960#issuecomment-667314221). we have tried a couple of things and we'll settle on something for the POC for the dogfood cluster. we should be able to tie things up for the POC evaluation sometime next week.

did some debugging, overlay creation and general support for our bigdata $customer.

next week:

i want us to finish up the dhall POC and do some evaluation of suitability and if/how we proceed. i'm leaning towards proceeding. i think dhall advantages outweigh some of the difficulties. but i don't want to inject priors into the evaluation process so disregard my last sentence :-)

pecigonzalo commented 4 years ago

Last week

We finished our initial team goals, I also finalized the review of RFC-199. We will make we test using microVMs with ignite for a v0 and will have to review the outcome of that testing before we can move to v1 and define how we deploy/support/HA/etc.

This week

We will kick-off our 360 review cycle and I will focus on that. Ill be working on the roadmap and a product readiness document with Stephen and will pair with Geoffrey to get more familiar with our Dhall implementation. I have not been able to progress RFC-202 and if time allows I would like to finish that up.

Team update

The high priority sourcegraph/customer#69 from last week has been resolved, and we will return to our tracking issue priorities. sourcegraph/customer#70 remains unclear as we can't reproduce it consistently and has been deprioritized for the moment.

ggilmore commented 4 years ago

This week:

slimsag commented 4 years ago

This week:

I spent most of my time, maybe 70% discussing things (distribution things, CE things, security things, code intel things, and more.) I spent 10% of my time helping customers, and 10% thinking about how to onboard CE folks. I made slight progress on release automation, but no progress on the other things I set out to do this week in my last update.

bobheadxi commented 4 years ago

This week

I followed up on last week's update and have finalized most of the work for per-team alerts, and have been dogfooding it (to myself). I have pinged each team to set up on-call rotations so that we can switch over completely to the new alerting stack and remove our old alerting by end of 3.19 or early 3.20. I have also made a range of improvements to our alerting, including: making our provisioning alerts more informative, converting some of our hard-threshold alerts to be ratios, improve our alerts solutions documentation. I also looked into adjusting resources for some of our services that seem like they could use it.

Next week

See if I can help others wrap up any outstanding tasks for this iteration, maybe work on converting more of our noisy alerts to be ratio-based, and start looking at what I can do in 3.20

uwedeportivo commented 4 years ago

this week:

did some dhall work experimenting with unit tests and some more encompassing customizations that span more than one resource. did some initial bootstrap for marek for the 3.19 release. some customer work with cap1.

pecigonzalo commented 4 years ago

Last week

Kicked-off 360 review cycle and I was focused on that. I paired with Geoffrey to get more familiar with our Dhall implementation and architecture. I meet with Eric to talk about running code intel on firecracker VMs and how would we deploy those.

This week

Ill be working mainly on our 360 reviews and 3.20 planning. Ill also like to Dhall, I would like to do more testing now that I understand its structure better. Im also working on improving our incidents pipeline so its easier to track the status and number of active incidents.

Team update

Given the number of customer issues, we are over our original estimates for support time, which will likely impact "Increase our e2e test frequency" and potentially "Reduce upgrade overhead" although we are looking to make a decision at the end of the sprint anyway. We started planning 3.20 last week and should by the end of the week.

marekweb commented 4 years ago

Dear all,

This is your release captain speaking. πŸš‚πŸš‚πŸš‚

Branch cut for the 3.19 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly. When in doubt, reach out!

Thank you

slimsag commented 4 years ago

This week:

I was out Mon and half of Tue due to a family emergency. Tue was spent getting caught up. I discussed Dhall with the team and helped to determine next steps, interviewed two candidates (eng and CE), and helped customers (#85 (completed), #62 (completed), #74 (in progress), #12999 (in progress), #73 (in progress)). I reduced the reliance of myself in shipping Sourcegraph to some customers, and am running behind on other areas (support, distribution roadmap, monitoring arch docs, service arch docs, etc.) due to the life incident I had earlier in the week.

ggilmore commented 4 years ago

This week, I got catfood.sgdev.org in a demo-able state for the Dhall PoC (along with the accompanying https://github.com/sourcegraph/deploy-sourcegraph-dhall changes). Following the Wednesday demo, the @sourcegraph/distribution team has enough confidence to move forward with the dhall implementation. My next priorities are to sync with the Dhall maintainers and develop the Dhall roadmap/tracking issue

pecigonzalo commented 4 years ago

Last Week

I have mostly been working on the 3.20 plan and 360 reviews. I worked on a couple of incidents as well (hostError on K8s nodes, https://github.com/sourcegraph/infrastructure/pull/2060, https://github.com/sourcegraph/customer/issues/75) and actively reviewing other Kubernetes notifications to identify fixes (eg. https://github.com/sourcegraph/deploy-sourcegraph/pull/816).

This Week

My priority for this week is finalizing the plans for 3.20 and meeting with the team to close the reviews. Additionally, ill continue to work through our alerts and find action items from them.

bobheadxi commented 4 years ago

last week

A lot of misc. work and doing investigations around alerts discussions in Slack (often around alerts frequencies). Fiddled around with Dhall, and made other misc. improvements to tooling (docsite, license_finder). Brainstormed a potential idea for next iteration around improving release dogfooding / our deploy-sourcegraph forks

next week

Stay up to date on 3.20 plans and see how I can pitch in. Probably follow up on teams setting up their opsgenie alerts, and continue following up on issues with alerts that get raised

davejrt commented 4 years ago

last week Landed this PR which provides the code to build a buildkite-agent running on GCP, as well as the terraform code to deploy the autoscaling group. Also spent a number of sessions with Kimberly helping her sourcegraph running locally in her wsl2 environment.

this week Stephen pointed out the performance issues with the buildkite-agents, especially with relation to booting a vagrant box as well as pulling docker images. I suspect this is related to the nested virt and some virtual box networking that will need to be tuned. Captured in #12996

slimsag commented 4 years ago

This week:

I made the pure-docker vagrant tests reliable and used them + the new documentation to ship v3.18.0 to https://app.hubspot.com/contacts/2762526/company/407948923/ further reducing their upgrade delay and requirement on myself. These also played a fundamental role in identifying a release blocker for 3.19.0 later in the week.

I spent some time helping a customer, fixing a bug for another customer, and a fair amount of time in meetings and interviewing candidates.

All of Thursday was spent helping @uwedeportivo with the 3.19.0 release which had a number of bad issues which plagued it. We managed to fix them in time to release on the 20th as promised. I also sync'd with a customer and did some live debugging to resolve a major issue for them

On Friday I addressed some tech debt, helped Rijnard with testing some structural search ideas, sync'd with a customer and responded to a customer P0

bobheadxi commented 4 years ago

this week

some debug sessions, follow-ups on per-team alerting and general improvements on that front (docs and tweaks). did some planning for k8s dogfooding in 3.20, and made a proof-of-concept for what it might look like

next week

wrap up the dogfooding automation for deploy-sourcegraph => dogfood, and work with someone to get it deployed proper

davejrt commented 4 years ago

Last week 17 July

I'd estimate 90% of my time was spent debugging the GCP/vagrant/docker e2e testing issues #12996. PR is up for review here. Doesn't appear to be directly related to networking as first thought, and a high CPU count has an impact based on my testing

This week 24 July

Iron out any remaining issues with this portion of the e2e testing and sync with Uwe around improving the release process. I need to look more closely at our autoscaling capabilites in GCP to ensure we're getting maximum value and reliability out of the config.

pecigonzalo commented 4 years ago

Last Week

Last week was focused on closing 3.19, planning 3.20 and closing our 360 review packages. I was able to make some progress in the Prometheus issue and sent a fix to the customer (thanks @bobheadxi and @uwe for the help here). I also started to organize our GCP Projects, so its easier to set permissions by using groups instead of users and merging deploy-sourcegraph to deploy-sourcegraph-dot-com, which will continue this week.

This Week

Ill work on adding our new on-call rotation to OpsGenia and sending a PR with the updated information to help finish per-team alerts and get some help to finalize the deploy-sourcegraph-dot-com merge. @bobheadxi and I will sync later this week so I can start helping on Dogfood Kubernetes deployments.

uwedeportivo commented 4 years ago

last week

fighting really hard to get 3.19 out the door. we had several issues: it started with the segfault in redis-cache in single-server, DNS issues in single-server, regression tests not logging in, regression tests referencing wrong test repos, build issues, update-docker-image-tags issues. after we finished tagging the final release for 3.19, stephen discovered a race condition in DB migration, so we had to fix, redact 3.19.0 and release 3.19.1

this week

i started looking in the dhall migration/keep-up-to-date tool. will continue playing with that for a couple of days. also help geoffrey with planning and fleshing out the dhall project for real now