Closed TheRealNate closed 3 years ago
Hi @therealnate I don't think this is an issue in Meteor core packages, it seems to me you have some code that is not finishing correctly so the Fiber is still running and not returning.
Can you reproduce the problem? If so you could get a CPU profile using this package so you could see what is using the CPU when the problem is happening https://github.com/quavedev/profile
Otherwise we would need a reproduction repository to be able to reproduce the problem.
Hi @filipenevola,
I've been trying to create a reproduction but so far I have been unable to. I do have some more info. I pulled this from a device that was experiencing the issue:
It looks like the first method (login) received a response, but the following methods aren't getting set. Coupled with the fact that loggingIn is stuck on true, could something be going wrong in between the login method getting a response and settings loggingIn to false?
I think I may have figured out the issue.
I was connecting to a second Meteor server simultaneously over DDP. I was using DDP.connect("...")
to establish a new connection, but it appears that triggers the DDP.onReconnect
event, even for the primary connection. If the timing lines up badly, this can stop the login method from resolving.
@filipenevola is this considered an issue, or is the DDP API not designed for multiple simultaneous connections?
We're experiencing the very same issue at https://my.apify.com. It started a couple of months ago, but got worse recently. Seems it has already been discussed at https://forums.meteor.com/t/meteor-login-method-never-invokes-callback-blocks-all-future-rpc-calls/36305/28
We'll post more details shortly.
Might be related to https://github.com/meteor/meteor/issues/10432 ?
I have been trying to debug the issue @jancurn wrote about on our production servers this week both using chrome dev tools on client and node dev tools on backend and found this (to provide context, we are running on meteor 1.12.1):
On client
The login method is properly handled until this point. The reconnectStopper is initialized, everything is set up but the code inside the event handler is never called.
Everything else going on in the client is at this point is put into queue and connection._methodsBlockingQuiescence
contains the invoked login method. This never changes. Doing any meteor calls using console results in the message being received by the client from the server, but it's put into queue so nothing happens with the result. This also only happens if user is not already logged in. If user is logged in, then the onReconnect
call is not called either, but there is nothing in _methodsBlockingQuiescence
.
Accounts._loggingIn
looks like this:
Where the dependency is this function: https://github.com/iron-meteor/iron-router/blob/205bb9eaf60e97d3fa3770f1f3740dc3f8157eda/lib/route_controller_client.js#L86
And here is how the Accounts.connection variable looks:
Actualy after testing it on functional server I'm not sure if the onReconnect
is part of the issue. Yes unlike when it's stuck it's called and I get to it in debugger, but looking at the app in that state it seems that it's called after other things happen that do not happen when it's stuck so it might be unrelated...
On server
On our backend I have not been able to use debugger to find a similar point, every time I tried to follow the login process I get to the point where Accounts._attemptLogin
successfuly finishes login and returns result but whathever happens after is in different fiber and I could not figure out how to continue the debugging.
We are running the meteor process behind load balancer on multiple kubernetes pods and it did not seem to be connected with heavy load. Each time I saw the issue the pods were at 60-70% cpu. Killing the pod causes the issue to reapear on different pod in half an hour or later. Restarting meteor process has the same effect.
I have also checked the Meteor.server.sessions
variable to see if there is anything weird. But there have been random numbers of sessions (from 37 to 80) and there was nothing waiting in queue in those sessions.
Hi @jancurn @gippy,
My issue seems to be due to DDP.onReconnect
being called because I am connecting to a second DDP server. Are you also trying to connect to a second DDP server? If not, your issue may be related to something else.
I had also discussed this with people on the Meteor forums. Some of their posts may be relevant to you: https://forums.meteor.com/t/meteor-loggingin-stuck-on-true-even-though-user-is-logged-in/55248
I'm not sure how related is our issue, but we also noticed that there is a DDP.onReconnect
event is called when this sort of hang happens. We don't explicitly connect to another server using DDP.connect
...
Just to add my +1
I've reported this issue before, but never found a resolution. We've been experiencing it for 4 or 5 years now.
Just had it happen on a Meteor 1.6 app that had been running for some weeks, restarting the node app resolves the issue but it's not an ideal solution because we don't have a way to alert ourselves until a support request comes in. Running N servers with sticky session cookies means that if 1 server fails only ~Total Users / N
users will experience the issue and may not even report it.
We're not running on Galaxy, just running the bundle with PM2 in this case (Nomad / Docker in other cases). We only have 1 DDP connection per client.
Once an instance of the server 'breaks' in this way, all new connections to that server will be broken. Logged in users aren't logged out but they won't be able to log out. Unauthenticated users won't be able to log in.
No idea how to reliably repro this issue other than running an app for a long period of time (days or weeks).
From our experience, the only thing that reliably stops the issue from happening is disabling the oplog tailing. We're now working on getting rid of it from our app without reducing the quality of the user experience. Yep, this is a hack
Hey @jancurn, when you say disabling Oplog Tailing, do you mean removing MONGO_OPLOG_URL
? I was having the issue before I had even added Oplog to the app.
@jancurn and @mjmasn, I think the issue you are having may be a different cause. I'll recap what seems to have been causing my issue:
1. Page Opens
2. App connects to app server as well as second DDP server
3. The connection to the second DDP server is established at the same time as the login to the first server completes, halting the login process flow and preventing the login process from finishing
4. This locks loggingIn to true, and blocks any other DDP calls from going through.
@jancurn, you mentioned that like me, you are also seeing DDP.onReconnect being called for a second time, but since you're not connecting to a second server the reason for it firing may be different.
I should mention that DDP.onReconnect only blocks the login method when the new connection is different than the original connection that login was called for. You may want to inspect the connection argument of DDP.onReconnect and see if it is indeed different, and if so, the connection params may give you an idea about why the issue is happening.
My original issue is resolved, but since lots of people are having similar issues I won't close this issue.
Cross posting my latest response on the Meteor forums below as I was the original forum poster and have more details since we've been living with this for a while:
We are still experiencing this (rarely, but it happens) despite a lot of changes and optimizations to our setup. A few notes on our setup:
We are not using standard Meteor oplog. We’re using redis-oplog and have been for over 2 years now. While it may relate to multiplexer code, the original oplog driver code isn’t in use for our codebase.
We are not using a 2nd server side DDP connection like @TheRealNate mentioned here
We most frequently see it happen when compute times ramp up on the server. Think of something that might be blocking the event loop. For us, we found that certain rare operations would block the event loop long enough such that many other DDP invocations sent from clients connected to a server would get backed up. Eventually, the server “recovers” in that its CPU and Memory are normal, any new or existing DDP connections to that server get stuck in the dead state I originally described.
We also have seen this happen if our Mongo cluster gets backed up - similar to DDP invocations getting backed up when the event loop is blocked, servers can sometimes run into the same issue with Mongo slowness or large data sets being returned.
@filipenevola 's comment above about a Fiber running but not returning is close to the conclusion we came to as well, but the issue is very difficult to catch in real-time or replicate. At this point we’re just very cognizant and cautious about how we handle large datasets to avoid this.
I should mention that we’ve also seen this happen in the event of a Mongo failover or Mongo configuration change quite often. Sometimes our Meteor servers can handle this, but in the unfortunate event that a failover or configuration change happens while we’re under decently heavy load, we can as anywhere from 20-50% of our Meteor servers experience this.
Except on @TheRealNate case I don't think the others are running on Galaxy.
I say that because I believe the reconnection can be happening more often out of Galaxy and the reconnection could be causing this issue (maybe the reconnection is causing the Fiber to be running without a return).
I'm just guessing here based on the testimonials.
Does that make sense?
If so we could reproduce the issue I believe because @TheRealNate said he was able to cause this connecting to a different DDP server, with a reproduction it would be much easier to isolate what is hanging and try to capture the problem and somehow interrupt this for the server to become responsive again.
What do you think?
@TheRealNate could you create a repo with a reproduction as you know exactly how are you causing the issue?
One week ago we deployed a build with a bug, which caused 100% CPU for container. We started to investigate it via live debugging. https://medium.com/the-node-js-collection/live-debugging-node-js-apps-at-the-command-line-cd5b58f883e1 Is is a super cool stuff which helped us to solve the issue in 5 mins.
@etyp try to run it for your container/process.
One week ago we deployed a build with a bug, which caused 100% CPU for container. We started to investigate it via live debugging. https://medium.com/the-node-js-collection/live-debugging-node-js-apps-at-the-command-line-cd5b58f883e1 Is is a super cool stuff which helped us to solve the issue in 5 mins.
@etyp try to run it for your container/process.
Thanks @afrokick for sharing this absolute gem of an article.
@jancurn any update on this? Seems like @TheRealNate fixed the issue in his app: https://forums.meteor.com/t/meteor-loggingin-stuck-on-true-even-though-user-is-logged-in/55248/26?u=storyteller
We stopped using oplog tailing and this type of issue is no longer happening.
I'm closing this issue because we haven't heard anything from the original poster for a while.
I am seeing an intermittent issue where some devices get stuck. The user is logged in (Meteor.user() returns user object) and Meteor.status() says that the server is connected, but Meteor.loggingIn() returns true, and no subsequent method calls or subscriptions work. Meteor.logout() also does not work.
After looking in the websocket message logs, I can see that the login method is called and the initial subscriptions (version updating, login service config, etc) are all made. The login method receives a result and the subscriptions receive their data and get a ready response. After that however, Meteor calls don't work, and are not posted to the websocket. The websocket also shows that the typical ping/pong messages continue to go back and forth (so it doesn't seem like the connection is being closed/interrupted)
Restarting the server (or in Galaxy, killing the container) fixes the issue temporarily. The issue does not affect all devices/users at the same time either. Reloading the page (even Ctrl/Cmnd + Shift + R) does not fix the issue.
No logs (that could be relevant) are appearing in the server logs.
This is a new deployment of an existing codebase. I have had the same app deployed on a separate Galaxy deployment and DigitalOcean for months and never seen this issue.
This issue is occurring on: Versions: 1.10.x-2.0 Environment: Production (Meteor Galaxy) (cannot be reproduced on dev)
Theres been some discussion over this on the forums, and it appears others may also be experiencing this: https://forums.meteor.com/t/meteor-loggingin-stuck-on-true-even-though-user-is-logged-in/55248
Here are a list of my packages: