[bug] GtS hangs for minutes under unclear circumstances; keeps connections open

mirabilos commented 1 year ago

Describe the bug with a clear and concise description of what the bug is.

I was replying to the thread around https://infosec.exchange/@malwaretech/109473999199870857 and the “TOOT!” button in Pinafore became a loading indicator.

Subsequent requests seemed to fail, so I navigated to the / of my instance and it also showed timeouts.

There’s no clear log “around” this. Some log lines were still added, but not very many. I did see a very suspicious one:

@400000006390e778343b9ec4 timestamp="07/12/2022 19:20:14.876" func=concurrency.(*WorkerPool).Queue.func1 level=ERROR type=worker.Worker[messages.FromClientAPI] error="BatchDeliver: at least one failure: POST request to https://floss.social/inbox failed (401): 401 Unauthorized; POST request to https://gts.superseriousbusiness.org/users/dumpsterqueer/inbox failed (403): 403 Forbidden; POST request to https://k.iim.gay/users/kim/inbox failed (403): 403 Forbidden; POST request to https://infosec.exchange/inbox failed (401): 401 Unauthorized; POST request to https://mastodon.vlaanderen/inbox failed (401): 401 Unauthorized; POST request to https://todon.eu/inbox failed (401): 401 Unauthorized; POST request to https://troet.cafe/inbox failed (401): 401 Unauthorized; POST request to https://slightly.social/inbox failed (401): 401 Unauthorized; POST request to https://masto.ai/inbox failed (401): 401 Unauthorized; POST request to https://chaos.social/inbox failed (401): 401 Unauthorized; POST request to https://nrw.social/inbox failed (401): 401 Unauthorized; POST request to https://pagan.plus/inbox failed (401): 401 Unauthorized; POST request to https://mas.to/inbox failed (401): 401 Unauthorized; POST request to https://mastodon.world/inbox failed (401): 401 Unauthorized; POST request to https://qoto.org/inbox failed (401): 401 Unauthorized; POST request to https://c.im/inbox failed (401): 401 Unauthorized; POST request to https://mastodon.nu/inbox failed (401): 401 Unauthorized; POST request to https://fosstodon.org/inbox failed (401): 401 Unauthorized; POST request to https://pony.social/inbox failed (401): 401 Unauthorized; POST request to https://med-mastodon.com/inbox failed (401): 401 Unauthorized; POST request to https://mastodonapp.uk/inbox failed (401): 401 Unauthorized; POST request to https://universeodon.com/inbox failed (401): 401 Unauthorized; POST request to https://mastodon.online/inbox failed (401): 401 Unauthorized; POST request to https://mastodon.social/inbox failed (401): 401 Unauthorized; POST request to https://mastodon.art/inbox failed (401): 401 Unauthorized; POST request to https://social.treehouse.systems/inbox failed (401): 401 Unauthorized; POST request to https://techhub.social/inbox failed (401): 401 Unauthorized; POST request to https://witches.live/inbox failed (401): 401 Unauthorized; POST request to https://mastodon.sdf.org/inbox failed (401): 401 Unauthorized; POST request to https://mastodon.nl/inbox failed (401): 401 Unauthorized; POST request to https://bsd.network/inbox failed (401): 401 Unauthorized" msg="message processing error"

The first time this happened, I restarted the service, and it became alive quasi-immediately and was usable again. The second time, I’m holding out, and by now the webbrowser shows its own timeout message instead of Apache’s; this apparently is because my MaxRequestWorkers is reached: GtS seems to be holding the Apache-2-as-proxy connections hostage; netstat -anp agrees: lots of ESTABLISHED and CLOSE_WAIT from Apache to GtS… but also lots of them from GtS to PostgreSQL, to other instances’ 443… and to seemingly-random local ports:

tcp      324      0 127.0.0.1:8980          127.0.0.1:36956         ESTABLISHED 29103/gotosocial

Interestingly enough, lynx 127.0.0.1:8980 does show me my instance’s start page, in this state.

What's your GoToSocial Version?

0.6.0 git-f9e5ec9

GoToSocial Arch

amd64 binary

Browser version

not relevant

What happened?

see above

What you expected to happen?

even if GtS hangs, it SHOULD not keep the connections to other things hostage; this includes both the requests from Apache 2 (as proxy) and those to elsewhere, which includes those to PostgreSQL, but apparently also others.

How to reproduce it?

I could reproduce it for a second time by commenting on the thread linked above.

Anything else we need to know?

This didn’t occur before the upgrade to 0.6.0, but then, said thread didn’t exist back then yet either.

tsmethurst commented 1 year ago

Oh wow, that's quite a bug :') Thanks!

mirabilos commented 1 year ago

Update: during the 5–10 minutes it took me to analyse and report this, the connections apparently timed out, and the number of Apache 2 processes went down.

The log shows a (very) few lines like this:

[Wed Dec 07 19:32:53.220565 2022] [proxy_http:error] [pid 29413] (20014)Internal error (specific information not available): [client xx.xx.xx.xx:xxxxx] AH01102: error reading status line from remote server 127.0.0.1:8980

During the >150 Apache processes hang-hold, the system was pretty idle:

top - 19:28:24 up 7 days,  3:22,  1 user,  load average: 0.09, 0.08, 0.08
Tasks: 273 total,   1 running, 272 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.8 us, 11.1 sy,  0.0 ni, 61.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1000024 total,    81032 free,   329884 used,   589108 buff/cache
KiB Swap:  2097148 total,  2025104 free,    72044 used.   373792 avail Mem

I guess I can raise the maximum amount of child processes, at risk of triggering the OOM killer (though low; as we can see, RAM wasn’t fully used either).

Fastidious commented 1 year ago

I posted to the thread without problems. It doesn’t mean the issue isn’t there, I just can’t replicate. Running the latest binary with systemd, proxied via Caddy.

mirabilos commented 1 year ago

Fastidious dixit:

I posted to the thread without problems. It doesn’t mean the issue isn’t there, I just can’t replicate.

Yeah, great, I cannot reproduce it with that thread any more either. My Apache process count grew but also shrunk again quickly enough (11→13→14→14→12) so it may have been something else, and this was just what tipped it over.

I’ll have a look at logs around the time…

mirabilos commented 1 year ago

OK, it’s not something I did.

top - 21:02:14 up 7 days,  4:55,  3 users,  load average: 27.36, 14.51, 5.70
Tasks: 456 total,   3 running, 452 sleeping,   0 stopped,   1 zombie
%Cpu(s): 40.0 us, 54.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  6.0 si,  0.0 st
KiB Mem :  1000024 total,    63856 free,   791272 used,   144896 buff/cache
KiB Swap:  2097148 total,  1667992 free,   429156 used.    41560 avail Mem

Now this I call load… this happened of its own, so it’s probably network-related.

mirabilos commented 1 year ago

As for logs…

GtS logs only go back to 21:02:04Z (I have a log rotation system set up to limit the size, similar to what kiim wrote but using DJB dæmontools). It basically shows slow queries, which is to be expected under the aforementioned load.

syslog is boring during these minutes.

Apache access_log is interesting. Filtering on just the GtS domain, I do not have any requests between 20:59:28 and 21:02:01 (which is roughly when I noticed the slowness). Before, however…

… a huge flood. Aggregating requests on minutes:

20:50 4
20:51 7
20:52 2
20:53 8
20:54 5
20:55 6
20:56 6
20:57 7
20:58 14
20:59 1307
21:01 16

The 20:59 block by IPs is very spread out. 429 unique addresses (no relation to the status code), all but the last twelve having a single-digit figure of requests during that time; the last ones are 10, 10, 12, 13, 16, 17, 17, 18, 22, 22, 26, 28 requests, so it’s not a nōn-D DoS at least.

These requests seem to be pretty much all from genuine instances. Mastodon, GtS, Friendica, Pleroma, Akkoma, Mitra, Plume. One each Misskey and microblogpub, two each Ruby and MastodonRubyGem, two Firefox.

The 28 requests in that minute IP seems to be used by 13 different instances, though.

I can provide a somewhat anonymised access_log for the minutes around the incident (minus IP addresses and TLS parameters, but with timestamps, request paths and user agents) privately if this helps with debugging.

I guess I’ll scale down Apache for now though…

igalic commented 1 year ago

unrelatedly, you might wanna consider switching to a threaded mpm, as 150 processes can end up being quite heavyweight, further contributing to the memory crisis

mirabilos commented 1 year ago

Mina Galić dixit:

unrelatedly, you might wanna consider switching to a threaded mpm, as

Can’t, all my publically reachable Linux VMs also run libapache-mod-php.

I also have only very limited experience with the other MPMs that boil down to, some trouble in prod when also proxying (to Tomcat, IIRC), and switching to prefork fixed them, so I never run anything except prefork in prod.

150 processes can end up being quite heavyweight, further contributing to the memory crisis

True that, despite all the sharing. 150 weren’t a problem (see the first top(1) snippet) but 350 were (second one).

bye, //mirabilos -- "Using Lynx is like wearing a really good pair of shades: cuts out the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL." -- Henry Nelson, March 1999

mirabilos commented 1 year ago

Requests by endpoint:

      1 GET /
      4 GET /.well-known/host-meta
    148 GET /.well-known/webfinger?…
      1 GET /@$myname
      2 GET /api/v1/accounts/…
      1 GET /api/v1/streaming/public/local
     30 GET /fileserver/…
    184 GET /users/$myname
      3 GET /users/$myname/.well-known/host-meta
      2 GET /users/$myname/.well-known/webfinger?…
     56 GET /users/$myname/collections/featured
     76 GET /users/$myname/followers
    109 GET /users/$myname/following
      6 GET /users/$myname/main-key
    119 GET /users/$myname/outbox
    450 GET /users/$myname/statuses/1specificpost
    153 GET /users/$myname/statuses/1specificpost/replies
      4 GET /users/$myname/statuses/4otherposts
      2 GET /users/$myname/statuses/4otherposts/replies
      3 GET /users/.well-known/host-meta
      2 GET /users/.well-known/webfinger?…
     40 POST /users/$myname/inbox

In case this helps… wow that one specific post with 603 hits must have been something…

… would it make sense to have a smallish cache for some often-visited endpoints and the last couple of posts, so these can be delivered really really fast and especially without bothering the database? (Ideally updated on change/creation, not cached when first requested, considering Mastodon seems to have all instances request at the same time.) Maybe. I’m just thinking around randomly, without any actual experience in caching and very very few in parallel programming.

… hrm, probably only one single-user, or at best few-user, instances. But it could be a configuration item so many-user instances can turn it off. What’s the target audience of GtS anyway, seems to me few-user instances would be the most?

mirabilos commented 1 year ago

Hrm, no, recent toots won’t help much. (Well, in this case it would probably have been recent except for the crossposter.) The one specific one is from 20:28 UTC, but was hit hard at 20:59 (which is precisely when @april@macaw.social replied to it… who has 4k1 followers…).

mperham commented 1 year ago

@mirabilos That's the "thundering herd" problem discussed in #1226. A solution is to use a caching proxy in front of GtS, have GtS create short lived cache entries for each toot json/html so when /user/foo/statuses/123456 is requested, the proxy will serve the cached entry for the next 2-3 minutes and not need to hit GtS at all. Have to think carefully about the included data though: fav count, boost count, deleting and re-editing a toot are examples of things which make the cache stale.

mirabilos commented 1 year ago

fav/boost count are normally always shown as 0 anyway, at least until you actually open the specific toot…

In this case, the webfinger requests (alone) were not the problem, ofc.

(I’ve tried looking at the data but after taking 15 minutes to figure out I need -H 'Accept: application/activity+json' and then only getting an error about the request not being signed, I give up.)

NyaaaWhatsUpDoc commented 1 year ago

Soon I plan on making a lot of our queue sizes and cache dials actually configurable, which might help narrow down this issue since people can twiddle things to try improve their situation. But I definitely do have more optimization profiling planned when I get time :)

tsmethurst commented 1 year ago

This seems to be fixed in current main with a combination of our image processing improvements, request throttling, and other little tweaks, so I'm going to close it :) Will be in a release soon(tm)

[edit] as always feel free to reopen if this crops up again

superseriousbusiness / gotosocial