pennsignals / legacy-system-services

2 stars 0 forks source link

Mssql notifier hung without notification #25

Open darrylmendillo opened 4 years ago

darrylmendillo commented 4 years ago

Incident:

mssql hung for days without notification

The last Derived Date update received was on 2020-05-05 04:18:28.480

Observation:

In loki staging, (monitoring production vent), we received the last vent-notify-mssql at 5/5/20 @ 4:30 AM.

Screen Shot 2020-05-07 at 1 54 07 PM

Solution:

Release promtail update which adds:

Conversation:

I will make sure the same group will get these notifications, but if we want a 24 hour support then service desk would need to be involved. I don’t normally check my emails/text at 4AM in the morning.

From: Gitelman, Yevgeniy Yevgeniy.Gitelman@pennmedicine.upenn.edu Sent: Thursday, May 7, 2020 12:15 PM To: Lubken, Jason Jason.Lubken@pennmedicine.upenn.edu Cc: Draugelis, Michael E Michael.Draugelis@pennmedicine.upenn.edu; Bortnik, Alex Alex.Bortnik@pennmedicine.upenn.edu; Fuchs, Barry Barry.Fuchs@uphs.upenn.edu; Roy, Rubina Rubina.Roy@pennmedicine.upenn.edu; Pollock, Kevin Kevin.Pollock@pennmedicine.upenn.edu; Mendillo, Darryl J Darryl.Mendillo@Pennmedicine.upenn.edu Subject: Re: HVICU

Emails go to myself Bortnik and Simpkins when there are API server side errors. I don’t see why that same group can’t get emails when the data from signals falls behind.

Notifying service desk just adds a layer of people who can’t do anything about the problem

Yevgeniy Gitelman, MD Clinical Assistant Professor of Medicine Section of Hospital Medicine Clinical Informatics Manager Center for Health Care Innovation 646-596-4528

On May 7, 2020, at 11:58 AM, Lubken, Jason Jason.Lubken@pennmedicine.upenn.edu wrote:

All,

The mssql connection failed, and has been restarted. I'll follow up with the vmware group on a probable root cause.

Thanks,

Jason From: Draugelis, Michael E Michael.Draugelis@pennmedicine.upenn.edu Sent: Thursday, May 7, 2020 11:31 AM To: Bortnik, Alex Alex.Bortnik@pennmedicine.upenn.edu; Gitelman, Yevgeniy Yevgeniy.Gitelman@pennmedicine.upenn.edu; Fuchs, Barry Barry.Fuchs@uphs.upenn.edu; Roy, Rubina Rubina.Roy@pennmedicine.upenn.edu; Pollock, Kevin Kevin.Pollock@pennmedicine.upenn.edu; Lubken, Jason Jason.Lubken@pennmedicine.upenn.edu; Mendillo, Darryl J Darryl.Mendillo@Pennmedicine.upenn.edu Subject: Re: HVICU

Jason and Darryl, Can you investigate the ventcue pipeline? We're still ingesting data, producing logs, and producing ventcue events, and text notifications. This may be another database interface break. Can you ensure the pipeline is healthy, then focus on the interface with the MSSQL database?

Kevin and Alex, regarding the process. Monitoring the I-LEAD database is a key missing piece. Let us know how we can help. We have an on-call rotation and a daily checkout at 9 am in the morning. But we're not monitoring the I-LEAD database or I-LEAD board. Is there an equivalent on-call process on for the I-LEAD components that we can synchronize

Mike Draugelis Chief Data Scientist, Penn Medicine 215-300-0979


From: Bortnik, Alex Alex.Bortnik@pennmedicine.upenn.edu Sent: Thursday, May 7, 2020 11:00 AM To: Gitelman, Yevgeniy; Fuchs, Barry; Draugelis, Michael E; Roy, Rubina; Pollock, Kevin Subject: RE: HVICU

Mike, The last Derived Date update we received was on 2020-05-05 04:18:28.480 We really need to fix this process.

From: Gitelman, Yevgeniy Yevgeniy.Gitelman@pennmedicine.upenn.edu Sent: Thursday, May 7, 2020 10:40 AM To: Fuchs, Barry Barry.Fuchs@uphs.upenn.edu; Draugelis, Michael E Michael.Draugelis@pennmedicine.upenn.edu; Roy, Rubina Rubina.Roy@pennmedicine.upenn.edu; Bortnik, Alex Alex.Bortnik@pennmedicine.upenn.edu; Pollock, Kevin Kevin.Pollock@pennmedicine.upenn.edu Subject: RE: HVICU

Mike – are we behind updating your data? From: Fuchs, Barry Barry.Fuchs@uphs.upenn.edu<mailto:Barry.Fuchs@uphs.upenn.edu> Sent: Thursday, May 07, 2020 10:37 AM To: Draugelis, Michael E Michael.Draugelis@pennmedicine.upenn.edu<mailto:Michael.Draugelis@pennmedicine.upenn.edu>; Roy, Rubina Rubina.Roy@pennmedicine.upenn.edu<mailto:Rubina.Roy@pennmedicine.upenn.edu>; Bortnik, Alex Alex.Bortnik@pennmedicine.upenn.edu<mailto:Alex.Bortnik@pennmedicine.upenn.edu>; Pollock, Kevin Kevin.Pollock@pennmedicine.upenn.edu<mailto:Kevin.Pollock@pennmedicine.upenn.edu>; Gitelman, Yevgeniy Yevgeniy.Gitelman@pennmedicine.upenn.edu<mailto:Yevgeniy.Gitelman@pennmedicine.upenn.edu> Subject: Fwd: HVICU

Wow, that’s a lot of errors. Please encourage this reporting behavior but they could forward directly to Alex Bortnik going forward and cc me.

I have no idea why it would say not on vent when it simultaneously is saying O2 delivery device - Vent. It might be that they documented as off vent in the other field which we value as more reliable - I am not sure - but have forwarded this to the IT group to look into.

Rubina- can you check out the discordance b/w Penn chart and the icu board on the cases with missing extubation screens?

Thanks Barry Sent from my iPad

Begin forwarded message: From: "Chandler, John" John.Chandler@pennmedicine.upenn.edu<mailto:John.Chandler@pennmedicine.upenn.edu> Date: May 7, 2020 at 6:17:06 AM EDT To: "Fuchs, Barry" Barry.Fuchs@uphs.upenn.edu<mailto:Barry.Fuchs@uphs.upenn.edu> Subject: FW: HVICU

FYI-see finding by eRN and Penn E-lert.

Thanks!

From: Irvine, Kristina Kristina.Irvine@pennmedicine.upenn.edu<mailto:Kristina.Irvine@pennmedicine.upenn.edu> Sent: Thursday, May 7, 2020 12:29 AM To: Chandler, John John.Chandler@pennmedicine.upenn.edu<mailto:John.Chandler@pennmedicine.upenn.edu> Cc: Williams, Maria Maria.Williams@pennmedicine.upenn.edu<mailto:Maria.Williams@pennmedicine.upenn.edu> Subject: HVICU

These are patients that are vented, but say "not vented" on ICU board. Also missing notifications for "missing extubation risk screens"

Just FYI

Kris Irvine

[cid:12190D3A-0B1A-468E-932B-0633CE5083D9][cid:31CEB1F8-7740-4C3D-95A9-546FC1EEA4DB][cid:4B19DABF-F80C-42D7-93AE-6518909B6A30][cid:C56456A6-0052-4F81-ACF3-2C982AFC5673][cid:76B829BB-FF1F-4759-B993-E1DAF378173B][cid:4B3C719B-B1EB-4492-A98C-16383CC02361][cid:018EFC73-BC89-48B0-9E63-505A8EEA2FD1][cid:49232EDD-0A9F-4D04-B571-85C0BBA291CB]

darrylmendillo commented 4 years ago

Expected Cause

from: Jason Lubken

It hangs due to a known and unfixed bug during slow mssql/network reconnect after the service restarts due to a mssql/network disconnect. The architecture is incorrect and can't handle this.

The root cause of the "partial connection" failure is not known. We've assumed that it is network flakiness, but I'm inclined to look into vmware vmotion now as the root cause of the initial disconnect and the partial reconnect after restarting. It is causing the problems with mongo, "no one else" is experiencing these issues, and mssql side lacks redundancy is worthless for debugging connectivity anyhow.

I was going to open a ticket with IT and ask if any of the worker nodes (and 147 specifically which was running vent) had vmotion during that time. Vent consumes a whole vm when under pressure. I the backing vmware node is over-provisioned, vmware may very well try to move that vm to a different node.

I have not done this yet. I think Chris Cordery is the new vmware contact.

I have outstanding IT tickets to:

  1. move vms to the new 5 node cluster which should allow us to resolve over-provisioning,
  2. get admin rights on the new cluster so we can restart hung vms and reallocate cpu/memory.

However, a better approach is to open a ticket to attach our monitoring to vmware logging outputs itself and add you as a vmware vsphere admin to all of our clusters. We shouldn't have to ask them to know if vmotion happened.