glitch-soc / mastodon

A glitchy but lovable microblogging server
https://glitch-soc.github.io/docs/
GNU Affero General Public License v3.0
690 stars 184 forks source link

Federation issues on v4.3.0-alpha.1+glitch #2609

Closed Crashdoom closed 7 months ago

Crashdoom commented 7 months ago

Steps to reproduce the problem

  1. Attempt to follow a remote user that does not have follow approvals required
  2. UI will initially show "Unfollow"
  3. Refreshing the page will show "Cancel Follow"

Expected behaviour

User should be followed

Actual behaviour

Follow appears pending forever

Detailed description

I'm honestly unsure how to troubleshoot this as mastodon-web is the only service that actually spits out anything about the follow attempt, logging the follow request to my inbox with a 202 Created.

I've checked Sidekiq and there appears to be a lot of LinkCrawlWorker errors with Aws::S3::Errors::InternalError: We encountered an internal error. Please try again. (We're using Cloudflare R2) but I'm unsure if that's related, or a separate issue.

Happy to provide / search for additional debug info as needed!

(As an aside, we're on v4.3.0 alpha as I wasn't sure how to update to a given version with Glitch-SOC, so any info on that for future reference would also be really appreciated!)

Mastodon instance

furry.engineer, pawb.fun

Mastodon version

v4.3.0-alpha.1+glitch

Technical details

If this is happening on your own Mastodon server, please fill out those:

Sidekiq Setup:

Crashdoom commented 7 months ago

Additional troubleshooting seems to imply follows are going through, at least between my accounts on furry.engineer and meow.social.

When following from furry.engineer to meow.social:

ClearlyClaire commented 7 months ago

This means outgoing communication from furry.engineer to meow.social works fine, but for whatever reason, the Accept activity from meow.social to furry.engineer does not get processed appropriately.

LinkCrawlWorker errors are in themselves not an issue, but I guess they may point at a common underlying issue, although that seems unlikely. Are there other error logs in sidekiq? What's the state of the queues, especially the ingress queue?

What did you update from?

(As an aside, we're on v4.3.0 alpha as I wasn't sure how to update to a given version with Glitch-SOC, so any info on that for future reference would also be really appreciated!)

There are no specific glitch-soc versions, it's a rolling release, as I do not have the ability to maintain multiple branches.

Crashdoom commented 7 months ago

LinkCrawlWorker errors are in themselves not an issue, but I guess they may point at a common underlying issue, although that seems unlikely. Are there other error logs in sidekiq? What's the state of the queues, especially the ingress queue?

All of the queues are completely empty, but dead jobs have piled up with those LinkCrawlWorker errors causing us to max out our dead jobs queue on both instances. I haven't seen any errors in the journalctl log for any of the workers, and the dead / retry jobs look to have the same typical issues: rate limits, instances being down / offline, etc. other than the LinkCrawlWorker and now RedownloadMediaWorker

What did you update from?

We were previously on v4.2.0+glitch and updated directly. I checked the Mastodon change logs to see what I needed to do for each update to make sure there wasn't anything that stood out, and I'm beginning to wonder if I missed something...

There are no specific glitch-soc versions, it's a rolling release, as I do not have the ability to maintain multiple branches.

Aah, yep, I figured something like that since I can imagine it would be a complete mess trying to organize that!

ClearlyClaire commented 7 months ago

Nothing immediately comes to mind. Does that occur with one account in particular, or can you reproduce this with multiple accounts? Does it occur with other kinds of activities?

Crashdoom commented 7 months ago

We're able to reproduce it with our admin accounts, and several users also reported it. Doesn't seem to be limited to meow.social either, trying to follow Gargron on mastodon.social did the same thing just now.

I've verified that incoming and outgoing posts work fine, incoming follows also seem to work fine, as do outgoing follow request approvals.

ClearlyClaire commented 7 months ago

Hm, I'm not sure what could be happening there. Can you check whether the person-to-be-followed appears in your follows list? It could be that the relationship cache is not correctly updated but the follow has been correctly processed.

Crashdoom commented 7 months ago

Yep, they appear under the follows on my end and under following on their end (e.g. https://techhub.social/@Raccoon/followers).

image

ClearlyClaire commented 7 months ago

Can you make sure you have properly restarted all Mastodon processes? That is, are you sure you're not running old sidekiq processes on some queue?

A possible explanation for what you're seeing is that we have changed how we cache relationship data so that cache invalidation is much more efficient. But if you are running two different versions of the code at once, one version will use one cache key, while another will use a different cache key.

Crashdoom commented 7 months ago

@ClearlyClaire I tried restarting and had the same issue, so completely stopped all services and cleanly brought them back up and that appears to have worked!

Doesn't back-fix the broken follows, but new follow attempts outgoing appear to be working again and that's all that matters!

Thank you for helping troubleshoot -- Closing!