tmolitor-stud-tu / mod_push_appserver

Simple and extendable appserver for XMPP pushes (aka. XEP-0357)
MIT License
25 stars 9 forks source link

Slow event 'unregister-push-token' #28

Closed ge0rg closed 3 years ago

ge0rg commented 4 years ago

The unregistering of push tokens is synchronously blocking the prosody process, causing the server to stall. This happens from time to time, and the delay depends on the server load on Google servers, apparently. This is just from today's log:

Jul 02 07:07:20 mod_log_slow_events     warn    Slow event 'unregister-push-token' took 32.18s: no recognised data
Jul 02 07:07:27 mod_log_slow_events     warn    Slow event 'unregister-push-token' took 1.42s: no recognised data
Jul 02 08:08:17 mod_log_slow_events     warn    Slow event 'unregister-push-token' took 10.37s: no recognised data
Jul 02 08:39:09 mod_log_slow_events     warn    Slow event 'unregister-push-token' took 2.69s: no recognised data
Jul 02 09:07:38 mod_log_slow_events     warn    Slow event 'unregister-push-token' took 2.37s: no recognised data
Jul 02 10:43:55 mod_log_slow_events     warn    Slow event 'unregister-push-token' took 2.79s: no recognised data
Jul 02 12:17:47 mod_log_slow_events     warn    Slow event 'unregister-push-token' took 3.22s: no recognised data
Jul 02 14:59:06 mod_log_slow_events     warn    Slow event 'unregister-push-token' took 36.41s: no recognised data

Here is a single event that failed, look at the time gap after "Unregistering failing FCM token":

Jul 02 14:58:29 push.yax.im:push_appserver_fcm  warn    FCM returned 1 failures:
Jul 02 14:58:29 push.yax.im:push_appserver_fcm  warn    Got FCM error: 'NotRegistered'
Jul 02 14:58:29 push.yax.im:push_appserver_fcm  debug   Adding unregister-push-token to prosody event queue...
Jul 02 14:58:29 push.yax.im:push_appserver      warn    Push handler for type 'fcm' not executed successfully: NotRegistered
Jul 02 14:58:29 yax.im:cloud_notify     debug   Got error of type 'wait' (internal-server-error) for identifier 'push.yax.im<*********': NOT increasing error count for this identifier
Jul 02 14:58:29 push.yax.im:push_appserver_fcm  warn    Unregistering failing FCM token *************
<!-- server stalls for 36 seconds -->
Jul 02 14:59:06 datamanager     debug   Removing empty push_appserver datastore for user *****@push.yax.im
Jul 02 14:59:06 push.yax.im:push_appserver      info    Unregistered push device, returning: 'OK', '*****', '******************************'
Jul 02 14:59:06 push.yax.im:push_appserver      debug   settings were: {
  last_successful_push = "2020-07-02T03:56:02Z",
  type = "fcm",
  secret = "************************",
  last_push_error = "2020-07-02T12:58:29Z",
  renewed = "2020-07-02T03:39:27Z",
  token = "*************************************************",
  registered = "2020-07-01T13:59:27Z",
  node = "********"
}
Jul 02 14:59:06 mod_log_slow_events     warn    Slow event 'unregister-push-token' took 36.41s: no recognised data
tmolitor-stud-tu commented 4 years ago

That has nothing to do with google's servers because the operation is entirely local (just removing the node and token from the datastore).

The stall seems to occur in line 73 where the prosody store opened by module:open_store(); is written: local ok, err = store:set(node, cache[node]);. The previous store:get should complete without loading the on disk store because the previous failed push should have already loaded this node into the cache. That means: it is not the store read but the store write that stalls (the store:set() call).

Conclusion: your disk seems to be congested and could not complete the store write in a timely manner. Last time I checked, your yax.im server sometimes had a delay of 3-4 seconds for a simple smacks ack request-response cycle. Maybe this is related.

tmolitor-stud-tu commented 4 years ago

Maybe the datastore implementation of prosody is not that performant.

Do you use mysql or files as backend? I guess it's mysql, because that will only return after flushing the value to disk while a file store would operate using the linux pagecache and return nearly immediately.

ge0rg commented 3 years ago

I'm sorry I didn't immediately come back to you. I'm using prosody's default storage for the push component, so it's a huge directory on disk with many files. Apparently it's still blocking though - it's a host machine with some more users on it, so that might cause disk contention?

tmolitor-stud-tu commented 3 years ago

I guess so, databases doing an fsync might slow down disk access for other processes, too.

You could try to use mysql datastore for the appserver, maybe that speeds things up.

ge0rg commented 3 years ago

From skimming the code in https://hg.prosody.im/trunk/file/tip/util/datamanager.lua#l210 it looks like it first writes the (empty) data to the disk, using atomic_store(), which will create a temp file and rename it to the actual relevant file afterwards, then os.delete() the newly written file.

ge0rg commented 3 years ago

Okay, so I've actually attached strace to the prosody process, and found this (you can't see the read calls because I've excluded them from strace):

22157 10:30:57.786546 write(3289, "Aug 31 10:30:57 push.yax.im:push"..., 237) = 237 <0.000093>
22157 10:30:57.786850 stat("/var/lib/prosody/push%2eyax%2eim/push_appserver", {st_mode=S_IFDIR|0750, st_size=6942, ...}) = 0 <0.000047>
22157 10:30:57.787053 open("/var/lib/prosody/push%2eyax%2eim/push_appserver", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 2565 <0.000028>
22157 10:30:57.787186 fstat(2565, {st_mode=S_IFDIR|0750, st_size=6942, ...}) = 0 <0.000019>
22157 10:30:57.787377 getdents(2565, /* 806 entries */, 32768) = 32752 <0.102758>
22157 10:30:57.890444 open("/var/lib/prosody/push%2eyax%2eim/push_appserver/aaaaaaaaaaaa.dat", O_RDONLY) = 3183 <0.000141>
22157 10:30:57.890776 fstat(3183, {st_mode=S_IFREG|0640, st_size=443, ...}) = 0 <0.000084>
22157 10:30:57.966310 close(3183)       = 0 <0.000066>
22157 10:30:57.966896 open("/var/lib/prosody/push%2eyax%2eim/push_appserver/bbbbbbbbbbbb.dat", O_RDONLY) = 3183 <0.000201>
22157 10:30:57.967358 fstat(3183, {st_mode=S_IFREG|0640, st_size=443, ...}) = 0 <0.000053>
22157 10:30:57.973762 close(3183)       = 0 <0.000042>
...
... goes on and on and on
...
22157 10:31:55.141422 getdents(2565, /* 0 entries */, 32768) = 0 <0.000434>
22157 10:31:55.142025 close(2565)       = 0 <0.000045>
22157 10:31:55.142375 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2335, ...}) = 0 <0.000065>
22157 10:31:55.142597 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2335, ...}) = 0 <0.000092>
22157 10:31:55.142847 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2335, ...}) = 0 <0.000046>
22157 10:31:55.143148 open("/var/lib/prosody/push%2eyax%2eim/push_appserver/kkkkkkkkkkkk.dat~", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 2565 <0.000443>
22157 10:31:55.143756 fstat(2565, {st_mode=S_IFREG|0640, st_size=0, ...}) = 0 <0.000034>
22157 10:31:55.143923 write(2565, "return {};\n", 11) = 11 <0.000082>
22157 10:31:55.144120 close(2565)       = 0 <0.000033>
22157 10:31:55.144269 rename("/var/lib/prosody/push%2eyax%2eim/push_appserver/kkkkkkkkkkkk.dat~", "/var/lib/prosody/push%2eyax%2eim/push_appserver/kkkkkkkkkkkk.dat") = 0 <0.000260>
22157 10:31:55.145151 write(3289, "Aug 31 10:31:55 datamanager\tdebu"..., 108) = 108 <0.000438>
22157 10:31:55.145744 unlink("/var/lib/prosody/push%2eyax%2eim/push_appserver/kkkkkkkkkkkk.dat") = 0 <0.000151>
22157 10:31:55.146271 write(3289, "Aug 31 10:31:55 push.yax.im:push"..., 174) = 174 <0.000067>

The respective log messages are:

Aug 31 10:30:57 push.yax.im:push_appserver_fcm  warn    Unregistering failing FCM token ***********
Aug 31 10:31:55 datamanager     debug   Removing empty push_appserver datastore for user kkkkkkkkkkkk@push.yax.im
Aug 31 10:31:55 push.yax.im:push_appserver      info    Unregistered push device, returning: 'OK', 'kkkkkkkkk', '******************'

It looks like the responsible code is in https://github.com/tmolitor-stud-tu/mod_push_appserver/blob/master/mod_push_appserver/mod_push_appserver.lua#L86 where it loads each of the existing ~6900 nodes, and looks for the one with the right token. It would be great to replace that with a separate look-up table that's stored to disk instead of only being re-created on-demand by looping the whole datastore.

I'm also not sure why that token wasn't in the token2node_cache table before, maybe it's getting invalidated in https://github.com/tmolitor-stud-tu/mod_push_appserver/blob/master/mod_push_appserver/mod_push_appserver.lua#L71 where it should be set instead?

Also what's the rationale behind 78883a82d5b8b522a3d216b64403f33e30f5588a? The token should be opaque for you, shouldn't it?

tmolitor-stud-tu commented 3 years ago

78883a82d5b8b522a3d216b64403f33e30f5588a was because of apple returning the token in the feedback service lowercased while the token coming from the device was uppercase. It should not be needed anymore because the new http/2 interface doesn't use the feedback service anymore.

tmolitor-stud-tu commented 3 years ago

I added the token2node_cache exactly because I wanted to loop over all files only once (the first time a token to node translation is used) and use the cache afterwards. Having the cache written to disk in a separate file has the risk of loosing synchronisation between the "main" files and the cache file.

Making the initial population of the cache async (and solving the bug that the cache seems to be populated more than once per prosody run) should solve your problem, right?

tmolitor-stud-tu commented 3 years ago

Thanks for your really goo analysis btw!!

ge0rg commented 3 years ago

I'm not sure how much moving this to async is going to help, or how it's supposed to synchronize the loaded data back into the main "thread".

Solving the bug (I suppose just putting the value into the cache table in line 71 should do it, right?) would be a significant improvement already, thanks.

tmolitor-stud-tu commented 3 years ago

Sure, I don't know why this line is even there.