openjs-foundation / security-collab-space

a repository for documenting and coordinating the foundation's security collaboration space
Apache License 2.0
23 stars 5 forks source link

Provide an SLA and a ticket system for the infrastructure maintained by the Foundation #42

Open mcollina opened 1 year ago

mcollina commented 1 year ago

The infrastructure provided by the OpenJS Foundation to the projects works great most of the time. However, the Fastify website (https://fastify.io) was down for two days because of a DNS misconfiguration. The only person I know who has all access for us is @bensternthal, who was unavailable during the weekend for personal reasons.

We should provide more support to our projects and documented escalation paths to somebody in the Linux Foundation that manages this kind of critical infrastructure (e.g. DNS). Note that this burden should not fall only on @bensternthal, but it should be shared by a few people so that it's way more likely for somebody to be online to fix problems in case they arise during non-work hours.

bensternthal commented 1 year ago

@vvalderrv & @tykeal I thought this would be relevant to the infrastructure workstream.

bensternthal commented 11 months ago

@vvalderrv since the infa workstream is going to take some time I was wondering if we had an existing process that we could leverage in these circumstances. e.g. can community members just file jira tickets, if so do we have any documentation on how this works that I could review?

tykeal commented 11 months ago

While the Linux Foundation does not provide SLAs to projects, we do have a service desk available at https://support.linuxfoundation.org where any community member can raise issues related to systems and infrastructure that we are running for them. Escalations of issues are available through that system by way of a button that appears after a minimum of 1 hour of being open. Outside of business hours this escalation is supposed to alert our 24/7 on call staff of an issue that may need servicing on infrastructure that we do not have monitors on.

Now, while do do not provide SLAs, we do attempt to respond as quickly as possible during standard business hours.

Please note that we do not monitor the support queue outside of standard business hours, our 24/7 on call staff are responding alerts related to monitoring of infrastructure.

tykeal commented 11 months ago

As a side comment on the above issue, sounds like a change was made on or just before a weekend, LF Release Engineering and the Operations team strongly advise against this sort of activity as it can easily lead to problems such as what is described in the issue.

mcollina commented 11 months ago

While the Linux Foundation does not provide SLAs to projects, we do have a service desk available at https://support.linuxfoundation.org/ where any community member can raise issues related to systems and infrastructure that we are running for them. Escalations of issues are available through that system by way of a button that appears after a minimum of 1 hour of being open. Outside of business hours this escalation is supposed to alert our 24/7 on call staff of an issue that may need servicing on infrastructure that we do not have monitors on.

@openjs-foundation/cpc I think we should document this somewhere in our docs.

As a side comment on the above issue, sounds like a change was made on or just before a weekend, LF Release Engineering and the Operations team strongly advise against this sort of activity as it can easily lead to problems such as what is described in the issue.

@tykeal no change should have been done to fastify.io: I did not request any. What was changed? The modification I asked to @bensternthal was about fastify.dev, which worked perfectly.

ljharb commented 11 months ago

I’m surprised an SLA wouldn’t be provided; that’s an industry standard that our projects should be able to rely on. If LF IT can’t provide one, perhaps we should find another infrastructure solution?

bensternthal commented 11 months ago

For me, the most important thing is that support (broadly defined) is in scope of the infra work the LF IT team is doing as part of the Sovereign Tech Fund work. AFAIK it will be. Let's give LF IT a chance to do their research, interview projects, and propose informed solutions.

ljharb commented 11 months ago

That's great - but I think a critical piece is beginning to provide an SLA moving forward.

ljharb commented 11 months ago

@tykeal I appear to have had a misunderstanding - you may have meant SLA in terms of responsiveness, while I was thinking SLA in terms of website/service uptime? Either way, I hope that we can figure out a way to get everyone's needs met, since obviously, sharing and utilizing LF IT's resources is the most efficient path, and I look forward to working with LF IT to figure this out!

tykeal commented 11 months ago

@ljharb you're correct SLA does translate to a general responsiveness agreement in every IT department I've ever worked in ;)

Again, LFIT does not provide such SLAs to our projects. The reason being we just aren't staffed for that sort of support and it would dramatically increase our costs if we were! We do, however offer our support queue which is staffed by our operations staff, release engineering staff, and our LFX platform support staff. We strive for, but do not guarantee, a response (but not necessarily resolution) with in 4 hours during standard business hours. Outside of that we have 24/7 monitoring of infrastructure that we have direct control over, that is we're in charge of the actual system that hosts it.

mcollina commented 11 months ago

Given that LFIT cannot provide SLAs for OpenJS, can access to critical system (eg DNS or even servers) be shared with more than one person in OpenJS (currently @bensternthal has access)? Ultimately problems would very seldomly arise, but when they do it's a big issue and the maintainers are personally grilled because things stop working.

bensternthal commented 11 months ago

I think there is a long term and short term solution to this issue. Long term as part of the DEST infra work, we have a well-documented common set of infra across our projects, with clearly defined owners and escalation paths. Short term, I think the goal is to have documentation on what to do if an OpenJS project runs into an issue or has infra questions.

I think the first step here would be to detail out the systems and who manages them. Honestly, it is not a ton and I think we probably have some shared misunderstandings of what LF IT, OpenJS, and the community have access to. We have quite a few projects with a hodgepodge of infra.

With the above I imagine we come up with some decision trees for common problems one might encounter. This would include backup / other contacts (to address @mcollina's concerns).

I'd be happy to take on the above if folks agree that it's a good next step.