Open darrylmendillo opened 4 years ago
from: Jason Lubken
It hangs due to a known and unfixed bug during slow mssql/network reconnect after the service restarts due to a mssql/network disconnect. The architecture is incorrect and can't handle this.
The root cause of the "partial connection" failure is not known. We've assumed that it is network flakiness, but I'm inclined to look into vmware vmotion now as the root cause of the initial disconnect and the partial reconnect after restarting. It is causing the problems with mongo, "no one else" is experiencing these issues, and mssql side lacks redundancy is worthless for debugging connectivity anyhow.
I was going to open a ticket with IT and ask if any of the worker nodes (and 147 specifically which was running vent) had vmotion during that time. Vent consumes a whole vm when under pressure. I the backing vmware node is over-provisioned, vmware may very well try to move that vm to a different node.
I have not done this yet. I think Chris Cordery is the new vmware contact.
I have outstanding IT tickets to:
However, a better approach is to open a ticket to attach our monitoring to vmware logging outputs itself and add you as a vmware vsphere admin to all of our clusters. We shouldn't have to ask them to know if vmotion happened.
Incident:
mssql hung for days without notification
The last Derived Date update received was on 2020-05-05 04:18:28.480
Observation:
In loki staging, (monitoring production vent), we received the last vent-notify-mssql at 5/5/20 @ 4:30 AM.
Solution:
Release promtail update which adds:
Conversation:
I will make sure the same group will get these notifications, but if we want a 24 hour support then service desk would need to be involved. I don’t normally check my emails/text at 4AM in the morning.
From: Gitelman, Yevgeniy Yevgeniy.Gitelman@pennmedicine.upenn.edu Sent: Thursday, May 7, 2020 12:15 PM To: Lubken, Jason Jason.Lubken@pennmedicine.upenn.edu Cc: Draugelis, Michael E Michael.Draugelis@pennmedicine.upenn.edu; Bortnik, Alex Alex.Bortnik@pennmedicine.upenn.edu; Fuchs, Barry Barry.Fuchs@uphs.upenn.edu; Roy, Rubina Rubina.Roy@pennmedicine.upenn.edu; Pollock, Kevin Kevin.Pollock@pennmedicine.upenn.edu; Mendillo, Darryl J Darryl.Mendillo@Pennmedicine.upenn.edu Subject: Re: HVICU
Emails go to myself Bortnik and Simpkins when there are API server side errors. I don’t see why that same group can’t get emails when the data from signals falls behind.
Notifying service desk just adds a layer of people who can’t do anything about the problem
Yevgeniy Gitelman, MD Clinical Assistant Professor of Medicine Section of Hospital Medicine Clinical Informatics Manager Center for Health Care Innovation 646-596-4528
On May 7, 2020, at 11:58 AM, Lubken, Jason Jason.Lubken@pennmedicine.upenn.edu wrote:
All,
The mssql connection failed, and has been restarted. I'll follow up with the vmware group on a probable root cause.
Thanks,
Jason From: Draugelis, Michael E Michael.Draugelis@pennmedicine.upenn.edu Sent: Thursday, May 7, 2020 11:31 AM To: Bortnik, Alex Alex.Bortnik@pennmedicine.upenn.edu; Gitelman, Yevgeniy Yevgeniy.Gitelman@pennmedicine.upenn.edu; Fuchs, Barry Barry.Fuchs@uphs.upenn.edu; Roy, Rubina Rubina.Roy@pennmedicine.upenn.edu; Pollock, Kevin Kevin.Pollock@pennmedicine.upenn.edu; Lubken, Jason Jason.Lubken@pennmedicine.upenn.edu; Mendillo, Darryl J Darryl.Mendillo@Pennmedicine.upenn.edu Subject: Re: HVICU
Jason and Darryl, Can you investigate the ventcue pipeline? We're still ingesting data, producing logs, and producing ventcue events, and text notifications. This may be another database interface break. Can you ensure the pipeline is healthy, then focus on the interface with the MSSQL database?
Kevin and Alex, regarding the process. Monitoring the I-LEAD database is a key missing piece. Let us know how we can help. We have an on-call rotation and a daily checkout at 9 am in the morning. But we're not monitoring the I-LEAD database or I-LEAD board. Is there an equivalent on-call process on for the I-LEAD components that we can synchronize
Mike Draugelis Chief Data Scientist, Penn Medicine 215-300-0979
From: Bortnik, Alex Alex.Bortnik@pennmedicine.upenn.edu Sent: Thursday, May 7, 2020 11:00 AM To: Gitelman, Yevgeniy; Fuchs, Barry; Draugelis, Michael E; Roy, Rubina; Pollock, Kevin Subject: RE: HVICU
Mike, The last Derived Date update we received was on 2020-05-05 04:18:28.480 We really need to fix this process.
From: Gitelman, Yevgeniy Yevgeniy.Gitelman@pennmedicine.upenn.edu Sent: Thursday, May 7, 2020 10:40 AM To: Fuchs, Barry Barry.Fuchs@uphs.upenn.edu; Draugelis, Michael E Michael.Draugelis@pennmedicine.upenn.edu; Roy, Rubina Rubina.Roy@pennmedicine.upenn.edu; Bortnik, Alex Alex.Bortnik@pennmedicine.upenn.edu; Pollock, Kevin Kevin.Pollock@pennmedicine.upenn.edu Subject: RE: HVICU
Mike – are we behind updating your data? From: Fuchs, Barry Barry.Fuchs@uphs.upenn.edu<mailto:Barry.Fuchs@uphs.upenn.edu> Sent: Thursday, May 07, 2020 10:37 AM To: Draugelis, Michael E Michael.Draugelis@pennmedicine.upenn.edu<mailto:Michael.Draugelis@pennmedicine.upenn.edu>; Roy, Rubina Rubina.Roy@pennmedicine.upenn.edu<mailto:Rubina.Roy@pennmedicine.upenn.edu>; Bortnik, Alex Alex.Bortnik@pennmedicine.upenn.edu<mailto:Alex.Bortnik@pennmedicine.upenn.edu>; Pollock, Kevin Kevin.Pollock@pennmedicine.upenn.edu<mailto:Kevin.Pollock@pennmedicine.upenn.edu>; Gitelman, Yevgeniy Yevgeniy.Gitelman@pennmedicine.upenn.edu<mailto:Yevgeniy.Gitelman@pennmedicine.upenn.edu> Subject: Fwd: HVICU
Wow, that’s a lot of errors. Please encourage this reporting behavior but they could forward directly to Alex Bortnik going forward and cc me.
I have no idea why it would say not on vent when it simultaneously is saying O2 delivery device - Vent. It might be that they documented as off vent in the other field which we value as more reliable - I am not sure - but have forwarded this to the IT group to look into.
Rubina- can you check out the discordance b/w Penn chart and the icu board on the cases with missing extubation screens?
Thanks Barry Sent from my iPad
Begin forwarded message: From: "Chandler, John" John.Chandler@pennmedicine.upenn.edu<mailto:John.Chandler@pennmedicine.upenn.edu> Date: May 7, 2020 at 6:17:06 AM EDT To: "Fuchs, Barry" Barry.Fuchs@uphs.upenn.edu<mailto:Barry.Fuchs@uphs.upenn.edu> Subject: FW: HVICU
FYI-see finding by eRN and Penn E-lert.
Thanks!
From: Irvine, Kristina Kristina.Irvine@pennmedicine.upenn.edu<mailto:Kristina.Irvine@pennmedicine.upenn.edu> Sent: Thursday, May 7, 2020 12:29 AM To: Chandler, John John.Chandler@pennmedicine.upenn.edu<mailto:John.Chandler@pennmedicine.upenn.edu> Cc: Williams, Maria Maria.Williams@pennmedicine.upenn.edu<mailto:Maria.Williams@pennmedicine.upenn.edu> Subject: HVICU
These are patients that are vented, but say "not vented" on ICU board. Also missing notifications for "missing extubation risk screens"
Just FYI
Kris Irvine
[cid:12190D3A-0B1A-468E-932B-0633CE5083D9][cid:31CEB1F8-7740-4C3D-95A9-546FC1EEA4DB][cid:4B19DABF-F80C-42D7-93AE-6518909B6A30][cid:C56456A6-0052-4F81-ACF3-2C982AFC5673][cid:76B829BB-FF1F-4759-B993-E1DAF378173B][cid:4B3C719B-B1EB-4492-A98C-16383CC02361][cid:018EFC73-BC89-48B0-9E63-505A8EEA2FD1][cid:49232EDD-0A9F-4D04-B571-85C0BBA291CB]