meteor / meteor

Meteor, the JavaScript App Platform
https://meteor.com
Other
44.37k stars 5.19k forks source link

Meteor logged in but Meteor.loggingIn() stuck on true / initial login method works but subsequent methods never complete #11323

Closed TheRealNate closed 3 years ago

TheRealNate commented 3 years ago

I am seeing an intermittent issue where some devices get stuck. The user is logged in (Meteor.user() returns user object) and Meteor.status() says that the server is connected, but Meteor.loggingIn() returns true, and no subsequent method calls or subscriptions work. Meteor.logout() also does not work.

After looking in the websocket message logs, I can see that the login method is called and the initial subscriptions (version updating, login service config, etc) are all made. The login method receives a result and the subscriptions receive their data and get a ready response. After that however, Meteor calls don't work, and are not posted to the websocket. The websocket also shows that the typical ping/pong messages continue to go back and forth (so it doesn't seem like the connection is being closed/interrupted)

Restarting the server (or in Galaxy, killing the container) fixes the issue temporarily. The issue does not affect all devices/users at the same time either. Reloading the page (even Ctrl/Cmnd + Shift + R) does not fix the issue.

No logs (that could be relevant) are appearing in the server logs.

This is a new deployment of an existing codebase. I have had the same app deployed on a separate Galaxy deployment and DigitalOcean for months and never seen this issue.

This issue is occurring on: Versions: 1.10.x-2.0 Environment: Production (Meteor Galaxy) (cannot be reproduced on dev)

Theres been some discussion over this on the forums, and it appears others may also be experiencing this: https://forums.meteor.com/t/meteor-loggingin-stuck-on-true-even-though-user-is-logged-in/55248

Here are a list of my packages:

accounts-password                  1.7.0  Password support for accounts
ecmascript                         0.15.0  Compiler plugin that supports ES2015+ in all .js files
email                              2.0.0  Send email messages
es5-shim                           4.8.0  Shims and polyfills to improve ECMAScript 5 support
http                               1.4.3  Make HTTP calls to remote servers
insecure                           1.0.7  (For prototyping only) Allow all database writes from the client
konecty:multiple-instances-status  1.1.0  Keep a collection with active servers/instances
konecty:user-presence              2.6.3  Track user status
mdg:meteor-apm-agent               3.3.0  Performance Monitoring for Meteor
meteor-base                        1.4.0  Packages that every Meteor app needs
mobile-experience                  1.1.0  Packages for a great mobile user experience
mongo                              1.10.1  Adaptor for using MongoDB and Minimongo over DDP
msavin:sjobs                       3.1.1* The simple jobs queue that just works [synced, schedule, tasks, background, later, worker, cron]
random                             1.2.0  Random number generator and utilities
react-meteor-data                  2.2.2  React hook for reactively tracking Meteor data
reactive-var                       1.0.11  Reactive variable
session-metrics                    0.0.1+
shell-server                       0.5.0  Server-side component of the `meteor shell` command.
standard-minifier-css              1.7.2  Standard css minifier used with Meteor apps by default.
standard-minifier-js               2.6.0  Standard javascript minifiers used with Meteor apps by default.
static-html                        1.2.2  Define static page content in .html files
typescript                         4.1.2  Compiler plugin that compiles TypeScript and ECMAScript in .ts and .tsx files
filipenevola commented 3 years ago

Hi @therealnate I don't think this is an issue in Meteor core packages, it seems to me you have some code that is not finishing correctly so the Fiber is still running and not returning.

Can you reproduce the problem? If so you could get a CPU profile using this package so you could see what is using the CPU when the problem is happening https://github.com/quavedev/profile

Otherwise we would need a reproduction repository to be able to reproduce the problem.

TheRealNate commented 3 years ago

Hi @filipenevola,

I've been trying to create a reproduction but so far I have been unable to. I do have some more info. I pulled this from a device that was experiencing the issue:

screenshot

It looks like the first method (login) received a response, but the following methods aren't getting set. Coupled with the fact that loggingIn is stuck on true, could something be going wrong in between the login method getting a response and settings loggingIn to false?

TheRealNate commented 3 years ago

I think I may have figured out the issue.

I was connecting to a second Meteor server simultaneously over DDP. I was using DDP.connect("...") to establish a new connection, but it appears that triggers the DDP.onReconnect event, even for the primary connection. If the timing lines up badly, this can stop the login method from resolving.

@filipenevola is this considered an issue, or is the DDP API not designed for multiple simultaneous connections?

jancurn commented 3 years ago

We're experiencing the very same issue at https://my.apify.com. It started a couple of months ago, but got worse recently. Seems it has already been discussed at https://forums.meteor.com/t/meteor-login-method-never-invokes-callback-blocks-all-future-rpc-calls/36305/28

We'll post more details shortly.

jancurn commented 3 years ago

Might be related to https://github.com/meteor/meteor/issues/10432 ?

gippy commented 3 years ago

I have been trying to debug the issue @jancurn wrote about on our production servers this week both using chrome dev tools on client and node dev tools on backend and found this (to provide context, we are running on meteor 1.12.1):

On client The login method is properly handled until this point. The reconnectStopper is initialized, everything is set up but the code inside the event handler is never called. Everything else going on in the client is at this point is put into queue and connection._methodsBlockingQuiescence contains the invoked login method. This never changes. Doing any meteor calls using console results in the message being received by the client from the server, but it's put into queue so nothing happens with the result. This also only happens if user is not already logged in. If user is logged in, then the onReconnect call is not called either, but there is nothing in _methodsBlockingQuiescence. Accounts._loggingIn looks like this: image Where the dependency is this function: https://github.com/iron-meteor/iron-router/blob/205bb9eaf60e97d3fa3770f1f3740dc3f8157eda/lib/route_controller_client.js#L86 And here is how the Accounts.connection variable looks: image

Actualy after testing it on functional server I'm not sure if the onReconnect is part of the issue. Yes unlike when it's stuck it's called and I get to it in debugger, but looking at the app in that state it seems that it's called after other things happen that do not happen when it's stuck so it might be unrelated...

On server On our backend I have not been able to use debugger to find a similar point, every time I tried to follow the login process I get to the point where Accounts._attemptLogin successfuly finishes login and returns result but whathever happens after is in different fiber and I could not figure out how to continue the debugging. We are running the meteor process behind load balancer on multiple kubernetes pods and it did not seem to be connected with heavy load. Each time I saw the issue the pods were at 60-70% cpu. Killing the pod causes the issue to reapear on different pod in half an hour or later. Restarting meteor process has the same effect. I have also checked the Meteor.server.sessions variable to see if there is anything weird. But there have been random numbers of sessions (from 37 to 80) and there was nothing waiting in queue in those sessions.

TheRealNate commented 3 years ago

Hi @jancurn @gippy,

My issue seems to be due to DDP.onReconnect being called because I am connecting to a second DDP server. Are you also trying to connect to a second DDP server? If not, your issue may be related to something else.

I had also discussed this with people on the Meteor forums. Some of their posts may be relevant to you: https://forums.meteor.com/t/meteor-loggingin-stuck-on-true-even-though-user-is-logged-in/55248

jancurn commented 3 years ago

I'm not sure how related is our issue, but we also noticed that there is a DDP.onReconnect event is called when this sort of hang happens. We don't explicitly connect to another server using DDP.connect...

mjmasn commented 3 years ago

Just to add my +1

I've reported this issue before, but never found a resolution. We've been experiencing it for 4 or 5 years now.

Just had it happen on a Meteor 1.6 app that had been running for some weeks, restarting the node app resolves the issue but it's not an ideal solution because we don't have a way to alert ourselves until a support request comes in. Running N servers with sticky session cookies means that if 1 server fails only ~Total Users / N users will experience the issue and may not even report it.

We're not running on Galaxy, just running the bundle with PM2 in this case (Nomad / Docker in other cases). We only have 1 DDP connection per client.

Once an instance of the server 'breaks' in this way, all new connections to that server will be broken. Logged in users aren't logged out but they won't be able to log out. Unauthenticated users won't be able to log in.

No idea how to reliably repro this issue other than running an app for a long period of time (days or weeks).

jancurn commented 3 years ago

From our experience, the only thing that reliably stops the issue from happening is disabling the oplog tailing. We're now working on getting rid of it from our app without reducing the quality of the user experience. Yep, this is a hack

TheRealNate commented 3 years ago

Hey @jancurn, when you say disabling Oplog Tailing, do you mean removing MONGO_OPLOG_URL? I was having the issue before I had even added Oplog to the app.

@jancurn and @mjmasn, I think the issue you are having may be a different cause. I'll recap what seems to have been causing my issue:

1. Page Opens
2. App connects to app server as well as second DDP server
3. The connection to the second DDP server is established at the same time as the login to the first server completes, halting the login process flow and preventing the login process from finishing
4. This locks loggingIn to true, and blocks any other DDP calls from going through.

@jancurn, you mentioned that like me, you are also seeing DDP.onReconnect being called for a second time, but since you're not connecting to a second server the reason for it firing may be different.

I should mention that DDP.onReconnect only blocks the login method when the new connection is different than the original connection that login was called for. You may want to inspect the connection argument of DDP.onReconnect and see if it is indeed different, and if so, the connection params may give you an idea about why the issue is happening.

My original issue is resolved, but since lots of people are having similar issues I won't close this issue.

etyp commented 3 years ago

Cross posting my latest response on the Meteor forums below as I was the original forum poster and have more details since we've been living with this for a while:

We are still experiencing this (rarely, but it happens) despite a lot of changes and optimizations to our setup. A few notes on our setup:

  1. We are not using standard Meteor oplog. We’re using redis-oplog and have been for over 2 years now. While it may relate to multiplexer code, the original oplog driver code isn’t in use for our codebase.

  2. We are not using a 2nd server side DDP connection like @TheRealNate mentioned here

  3. We most frequently see it happen when compute times ramp up on the server. Think of something that might be blocking the event loop. For us, we found that certain rare operations would block the event loop long enough such that many other DDP invocations sent from clients connected to a server would get backed up. Eventually, the server “recovers” in that its CPU and Memory are normal, any new or existing DDP connections to that server get stuck in the dead state I originally described.

We also have seen this happen if our Mongo cluster gets backed up - similar to DDP invocations getting backed up when the event loop is blocked, servers can sometimes run into the same issue with Mongo slowness or large data sets being returned.

@filipenevola 's comment above about a Fiber running but not returning is close to the conclusion we came to as well, but the issue is very difficult to catch in real-time or replicate. At this point we’re just very cognizant and cautious about how we handle large datasets to avoid this.

I should mention that we’ve also seen this happen in the event of a Mongo failover or Mongo configuration change quite often. Sometimes our Meteor servers can handle this, but in the unfortunate event that a failover or configuration change happens while we’re under decently heavy load, we can as anywhere from 20-50% of our Meteor servers experience this.

filipenevola commented 3 years ago

Except on @TheRealNate case I don't think the others are running on Galaxy.

I say that because I believe the reconnection can be happening more often out of Galaxy and the reconnection could be causing this issue (maybe the reconnection is causing the Fiber to be running without a return).

I'm just guessing here based on the testimonials.

Does that make sense?

If so we could reproduce the issue I believe because @TheRealNate said he was able to cause this connecting to a different DDP server, with a reproduction it would be much easier to isolate what is hanging and try to capture the problem and somehow interrupt this for the server to become responsive again.

What do you think?

@TheRealNate could you create a repo with a reproduction as you know exactly how are you causing the issue?

afrokick commented 3 years ago

One week ago we deployed a build with a bug, which caused 100% CPU for container. We started to investigate it via live debugging. https://medium.com/the-node-js-collection/live-debugging-node-js-apps-at-the-command-line-cd5b58f883e1 Is is a super cool stuff which helped us to solve the issue in 5 mins.

@etyp try to run it for your container/process.

johannel00 commented 3 years ago

One week ago we deployed a build with a bug, which caused 100% CPU for container. We started to investigate it via live debugging. https://medium.com/the-node-js-collection/live-debugging-node-js-apps-at-the-command-line-cd5b58f883e1 Is is a super cool stuff which helped us to solve the issue in 5 mins.

@etyp try to run it for your container/process.

Thanks @afrokick for sharing this absolute gem of an article.

StorytellerCZ commented 3 years ago

@jancurn any update on this? Seems like @TheRealNate fixed the issue in his app: https://forums.meteor.com/t/meteor-loggingin-stuck-on-true-even-though-user-is-logged-in/55248/26?u=storyteller

jancurn commented 3 years ago

We stopped using oplog tailing and this type of issue is no longer happening.

filipenevola commented 3 years ago

I'm closing this issue because we haven't heard anything from the original poster for a while.