[bug] Delivery to Mastodon servers gives 401 responses

bronikowski commented 1 year ago

Describe the bug with a clear and concise description of what the bug is.

It worked fine two days ago. Today I noticed that any action (posting, favoring) gets rejected by slew of of Mastodon running instances.

Deliver: at least one failure: POST request to https://chaos.social/inbox failed: 401 Unauthorized; POST request to https://101010.pl/inbox failed: 401 Unauthorized; POST request to https://mstdn.social/inbox failed: 401 Unauthorized; POST request to https://circumstances.run/inbox failed: 401 Unauthorized; POST request to https://c.im/inbox failed: 401 Unauthorized; POST request to https://lewacki.space/inbox failed: 401 Unauthorized; [cut for clarity, it goes for another dozen servers]

So either some upgrade to Mastodon broke something or my database (SQLite) got corrupted.

What's your GoToSocial Version?

0.8.1

GoToSocial Arch

amd64 binary

What happened?

No response

What you expected to happen?

Messages/actions should be delivered to inbox.

How to reproduce it?

No response

Anything else we need to know?

Delivery to my inbox works just fine for all affected instances. I can see them flow in, I just can't interact with them.

tsmethurst commented 1 year ago

Thanks for filing the bug! Does this always occur when delivering to these instances, or is it intermittent? I have an inkling that this might be related to the way we were delivering messages previous to the 0.9.0-rc, so it could be that it's already fixed on main, not sure.

bronikowski commented 1 year ago

Thank you for quick response. I think it happened out of blue two days ago. Just to test it I favorited some mastodon.social post, it went without a hitch. Just for completeness sake: it runs 4.1.2. I checked the one that rejects me, it runs v4.1.2+glitch.th. I checked another one that rejects me, and it's 4.1.2, so we can exclude it as a vector.

I favorited few more posts and again:

timestamp="15/05/2023 11:54:18.522" func=processing.(Processor).ProcessFromClientAPI level=INFO activityType=Create objectType=Like fromAccount=emil msg="processing from client" timestamp="15/05/2023 11:54:18.522" func=federation.(federatingActor).Send level=INFO msg="send activity Like via outbox https://helo.fuse.pl/users/emil/outbox" timestamp="15/05/2023 11:54:18.527" func=middleware.Logger.func1.1 level=INFO latency=9.183239ms clientIP=192.168.1.36 userAgent="Mozilla/5.0 (X11; Linux aarch64; rv:109.0) Gecko/20100101 Firefox/112.0" method=POST statusCode=200 path=/api/v1/statuses/01H0CNCVSR7SCQ6D08Z9R5QVWY/favourite requestID=2r5xa7m804000ajwxmy0 msg="OK: wrote 1.56kiB" timestamp="15/05/2023 11:54:18.550" func=transport.(transport).do level=INFO pubKeyID=https://helo.fuse.pl/users/emil/main-key method=POST url=https://infosec.exchange/inbox msg="performing request" timestamp="15/05/2023 11:54:18.982" func=processing.(Processor).EnqueueClientAPI.func1 level=ERROR msg="error processing client API message: BatchDeliver: at least one failure: POST request to https://infosec.exchange/inbox failed: 401 Unauthorized"

I have this feeling it's something at my end, I changed my setup a little, maybe forwarding proxy is mangling a POST request, stripping signatures?

bronikowski commented 1 year ago

Here's fav-undo action with log level set to debug:

timestamp="15/05/2023 12:10:28.541" func=federation.(federatingActor).Send level=INFO msg="send activity Undo via outbox https://helo.fuse.pl/users/emil/outbox" timestamp="15/05/2023 12:10:28.542" func=federatingdb.(federatingDB).NewID level=DEBUG newID="{\"@context\":\"https://www.w3.org/ns/activitystreams\",\"actor\":\"https://helo.fuse.pl/users/emil\",\"object\":{\"actor\":\"https://helo.fuse.pl/users/emil\",\"id\":\"https://helo.fuse.pl/users/emil/liked/01H0FDA2RS8ETSEHPD1P26JWJ6\",\"object\":\"https://infosec.exchange/users/SecurityWriter/statuses/110366049959734415\",\"to\":\"https://infosec.exchange/users/SecurityWriter\",\"type\":\"Like\"},\"to\":\"https://infosec.exchange/users/SecurityWriter\",\"type\":\"Undo\"}" msg="entering NewID" timestamp="15/05/2023 12:10:28.548" func=uris.IsUserPath level=DEBUG id=https://helo.fuse.pl/users/emil msg="entering Get" timestamp="15/05/2023 12:10:28.553" func=middleware.Logger.func1.1 level=INFO latency=17.970767ms clientIP=192.168.1.36 userAgent="Mozilla/5.0 (X11; Linux aarch64; rv:109.0) Gecko/20100101 Firefox/112.0" method=POST statusCode=200 path=/api/v1/statuses/01H0CNCVSR7SCQ6D08Z9R5QVWY/unfavourite requestID=6zce67m804000zfsm330 msg="OK: wrote 1.56kiB" timestamp="15/05/2023 12:10:28.577" func=transport.(transport).do level=INFO pubKeyID=https://helo.fuse.pl/users/emil/main-key method=POST url=https://infosec.exchange/inbox msg="performing request" timestamp="15/05/2023 12:10:28.840" func=processing.(Processor).EnqueueClientAPI.func1 level=ERROR msg="error processing client API message: BatchDeliver: at least one failure: POST request to https://infosec.exchange/inbox failed: 401 Unauthorized"

bronikowski commented 1 year ago

Switched to trace. I notice that the servers that won't talk to me don't have the PASSED line in the logs:

timestamp="15/05/2023 12:19:40.458" func=federation.(*federator).AuthenticateFederatedRequest level=TRACE requestID=4x2er7m804001p028chg msg="authentication for https://oldbytes.space/actor PASSED with algorithm rsa-sha256"

So, have I lost their keys and if yes, is that fixable or am I in deep brown?

tsmethurst commented 1 year ago

Thanks for the logs!

have I lost their keys

I don't see how this would be the case...

I changed my setup a little

Could you explain what you changed?

bronikowski commented 1 year ago

I don't see how this would be the case...

I'm just guessing at this point, but I took a deeper look and maybe it's not it.

Could you explain what you changed?

Before I had a VPS that served as a frontend (nginx) that served GTS instance from my home bound RPi4 via ssh-tunnel. So it would proxy localhost:8090. Now I moved everything home, build a Proxmox server and migrated from RPi. Now the frontend and GTS live on the same machine, with different IP addresses (.35 for GTS, .36 for httpd).

The config, bar proxy_pass value, didn't change. The version of nigix changed since the old VPS were running very old Ubuntu LTS (now Debian 11).

As I migrated nginx last, I ran old VPS frontend with a GTS instance on a new hardware, and it worked just fine. I wish I could retest it but it would lead to some major downtime for me.

bronikowski commented 1 year ago

clip

left: old frontend right: new

tsmethurst commented 1 year ago

Ahhh I see I see... could you try changing your proxy_pass value to the actual IP address that bee corresponds to? Check the docs here: https://docs.gotosocial.org/en/latest/installation_guide/nginx/#set-up-nginx We were seeing issues a while ago where setting a hostname rather than IP address could cause issues with remotes that use ipv6...

bronikowski commented 1 year ago

Changed to corresponding IP, but no luck. I'm out of ideas.

tsmethurst commented 1 year ago

Hmm. Does nginx log anything interesting? (also, you reloaded it right?)

bronikowski commented 1 year ago

I'm looking at both logs with tail -f. Yeah, I reloaded. Restarted even. The only "interesting" line is a 401 on GET

[15/May/2023:14:14:11 +0200] "GET /users/emil HTTP/1.1" 401 82 "-" "Pleroma 2.5.51-8627-g9aa446ef-neckbeard

Other than that, everything looks fine, except those unsigned/badly signed POSTs (that I can only see on GTS since it's a client's call, right?)

bronikowski commented 1 year ago

mastodon.social works without any problem, most smaller instances fail. Very strange. Anyway, I'll leave it running, install RC 0.9.0, and if the problem persists I'm going to drop it. Thank you for your pointers and have a great day.

daenney commented 1 year ago

Mmm, so I noticed one thing:

 func=transport.(*transport).do level=INFO pubKeyID=https://helo.fuse.pl/users/emil/main-key method=POST url=https://infosec.exchange/inbox msg="performing request"
timestamp="15/05/2023 12:10:28.840" func=processing.(*Processor).EnqueueClientAPI.func1 level=ERROR msg="error processing client API message: BatchDeliver: at least one failure: POST request to https://infosec.exchange/inbox failed: 401 Unauthorized"

It's using the Batch deliver, using the shared inbox on infosec.exchange/inbox instead of on each actor. This is great, but I'm wondering if we may have a bug that's causing us to use the shared inbox incorrectly/when not available? It's weird it suddenly started to fail though.

For the other errored instances, do you see a Batch Deliver there too?

daenney commented 1 year ago

I'm actually seeing a bunch of BatchDeliver errors on my instance too, tech.lgbt, goblin.technology, k.iim.gay, ocamlot.xyz and on it goes. Earliest I have them in my logs is April 5th, but my logs go back further than that. It does seem to happen rather rarely, sometimes I go days without seeing that in logs.

In most of my logs it seems to be on an individual user inbox, but for some Mastodon instances I'm seeing it on the global shared inbox, like social.pixie.town for example too.

bronikowski commented 1 year ago

You know what, I don't remember seeing batch delivery before, too. No, I think when I interact with mastodon.social it's on per actor basis.

daenney commented 1 year ago

Last time I've seen it in my logs is May 1st. Though I've been running main for a few weeks so there might actually be a fix related to it already. Or it's just dumb luck.

tsmethurst commented 1 year ago

Can I just check another thing: are you allowing nginx to cache your /main-key endpoint as described here: https://docs.gotosocial.org/en/latest/installation_guide/advanced/#caching-webfinger-webhost-metadata-and-public-key-responses? I'm assuming no because I don't see it in your config, but just wanted to check. The reason I'm asking is because normally in response to outgoing requests/message deliveries, I'd expect to see Mastodon instances requesting your main-key endpoint, but I don't see those requests in the logs you posted.

bronikowski commented 1 year ago

No, I don't have any caching for that site. I'm spitballing here, maybe whatever the failing servers use for DNS still didn't refresh after I moved? That would be crazy but not unheard of? They resolve old IP, kiss the pavement, get no keys from my endpoint, no auth?

Is that plausible?

tsmethurst commented 1 year ago

Mmmm possibly... Follow up question: when you moved, did you backup and restore your database? Or did you recreate your instance at the new location from scratch? If the latter, it could just be that you get rejected by remote instances because they don't recognize the key you're signing your requests with.

bronikowski commented 1 year ago

I moved IP two days ago, but I used the database before the move (I just switched nginx frontends) and it worked fine, this shouldn't be a problem bar corruption.

EDIT: this starts to make sense, resolve problems, I mean. I noticed that a person, who posts a lot and is on one of the problematic instances, didn't land in my TL for days. That's impossible. ;)

tsmethurst commented 1 year ago

I'm starting to suspect this might be an ipv6 vs ipv4 issue.. Can you verify that your ipv6 address is set correctly in your dns records? Just to rule that out.

bronikowski commented 1 year ago

I only had IPv6 on that VPS I switched from, no support for it at home… Jesus, dude, @tsmethurst I had IPv6 entry in zones, still. I think you've solved it. This is so, so stupid of me. Embarrassing even.

It's always DNS in the end, right?

I'm so sorry for taking your valuable time.

tsmethurst commented 1 year ago

Hahaha :D Noooo problem, just happy it works!

bronikowski commented 1 year ago

Dang, sorry for re-opening this one, but it wasn't the IPv6 issue. In the mean time I tried everything, even replicated the old setup in new hardware, I ssh-forwarded GTS' port to nginx host and proxy_pass a local address. I mucked about with Host header, but curl -v shows it correctly, so it can't be it.

I don't think I can move forward without talking with one of the admins running instances that 401 me.

[some time passed]

I talked with one of the admins, direct messaging seems to be working. He gave me some logs but nothing seemed suspicious. Maybe I should see what the spec says about batch deliveries.

bronikowski commented 1 year ago

OK, we (and by "we" I mean the other admin) figured it out. My NAT mangled the connection (I don't know how or why), when I moved my nginx to DMZ… everything worked.

tsmethurst commented 1 year ago

What a rollercoaster :D

superseriousbusiness / gotosocial