nightscout / cgm-remote-monitor

nightscout web monitor
GNU Affero General Public License v3.0
2.42k stars 71.75k forks source link

MongoDB Atlas reports too many connections to the database #8024

Open psonnera opened 1 year ago

psonnera commented 1 year ago

If you need support for Nightscout, PLEASE DO NOT FILE A TICKET HERE For support, please post a question to the "CGM in The Cloud" group in Facebook (https://www.facebook.com/groups/cgminthecloud) or visit the WeAreNotWaiting Discord at https://discord.gg/zg7CvCQ

Describe the bug We have several reports of AAPS/Loop users receiving a mail from MongoDB specifying they are using more than 500 connections and M0 doesn't support this. As a result, Nightscout will crash.

To Reproduce Not known yet, users confirm they didn't change the amount of uploaders/downloaders

Expected behavior 500 connections seem impossible for Nightscout

Screenshots

image image

Your setup information

Additional context Not known yet

(Will edit and update with new information, not using Atlas I haven't seen the issue)

psonnera commented 1 year ago

The same issue has been seen with Railway image

sulkaharo commented 1 year ago

This is a pretty bizarre issue. Nightscout doesn't do any connection management, with the assumption that the MongoDB driver pools the connections and would never need more than a few concurrent connections given we're reusing the connection object. This makes me wonder if there's been an update to the driver that requires code changes, there's a bug in the MongoDB driver or if there's a bug in Nightscout that causes runaway connections / connection leaks. The relevant code in NS hasn't changed and I never observed that many connections being used. @bewest ideas?

bewest commented 1 year ago

Similar outlook, @sulkaharo. Maybe the MongoDB version? Are they using Mongo 6 or some version that changes the backwards compatibility?

bewest commented 1 year ago

What is the time period for these connections and how does Atlas measure them? It's possible that if the server is shedding connections, the client will have to reconnect. It's also possible that if Nightscout is crashing, perhaps due to an AAPS data issue, that restarting lots of times will burn through more connections. It's worth looking carefully to see if the crashes are causing the connections to increase vs other way around. In general in the case of a db error, Nightscout should show a detailed error, page not crash. However, crashing will generate 8 new connections when the hosting provider restarts the process. What's in the crash logs?

jamesthurlow commented 1 year ago

@bewest version 6 I think. Just trying to figure out how to get access to crash logs.

image

sulkaharo commented 1 year ago

Isn't this concurrent connections though? If so, Nightscout crashing can't be the culprit as crashing should release the connections. I recently observed my instance having gone to a mode where some event that causes the data reload was being triggered very often - if that happens in a runaway way, that could cause a lot of connections being used. I'll add a constraint to how often that can happen.

sulkaharo commented 1 year ago

Right, so bootevent already implements debounce, but it's only set to cap the data updates to once / 5 seconds, which I suspect is too fast for Atlas. Depending on how they defer the execution of queries, I guess this could cause 500 connections to be consumed if something was causing the data update event to be triggered frequently. I suggest we raise that DB load debounce to 15 seconds and see if that helps. With the current server implementation, this load is theoretically not needed at all.

The other potential culprit is the ddata_at REST endpoint, which has no cap to how frequently it can be called and it'd be very easy to take an instance down by calling that at a rapid rate.

bewest commented 1 year ago

@jamesthurlow, if your Nightscout is hosted in Heroku, then it will be there, or Railway, Flyio, etc. Before this issue, were you previously hitting the Atlas size quotas and then deleted some data? Maybe we can add some instrumentation to the pool to see if we can monitor the issue ourselves?

jamesthurlow commented 1 year ago

Hi @bewest - have enabled access to @psonnera. I'm not precious about my account so if someone else needs to dive in let me know email address.

Quite sometime ago I did hit limits on quotas. Deleted a whole load of data and the problem went away.. must have been 6 months ago.

Just to add I am using AAPS - not sure if that is a factor or not.

ninelore commented 8 months ago

Hello everyone, this issues recently appeared after I upgraded to 15.0.2 on fly.io.

Connections went down in Atlas after i shut down my nightscout instance for testing image