snarfed / bridgy

📣 Connects your web site to social media. Likes, retweets, mentions, cross-posting, and more...
https://brid.gy
Creative Commons Zero v1.0 Universal
702 stars 51 forks source link

Bluesky publish sometimes fails blob upload #1670

Closed JoelOtter closed 2 months ago

JoelOtter commented 4 months ago

Need to look into this properly but putting it here so it's noted.

D 2024-02-21 20:47:28.922451+00:00 [com.atproto.repo.uploadBlob](http://com.atproto.repo.uploadblob/): {} <[urllib3.response.HTTPResponse](http://urllib3.response.httpresponse/) object at 0x3e169542add0>
D 2024-02-21 20:47:28.923977+00:00 Running requests.<function post at 0x3e16aa2cafc0> https://bsky.social/xrpc/com.atproto.repo.uploadBlob <[urllib3.response.HTTPResponse](http://urllib3.response.httpresponse/) object at 0x3e169542add0>  {'User-Agent': 'Bridgy (https://brid.gy/about)', 'Content-Type': 'image/jpeg', 'Authorization': '...'}
D 2024-02-21 20:47:29.168423+00:00 Got: {"error":"InvalidRequest","message":"A request body is expected but none was provided"}
W 2024-02-21 20:47:29.168739+00:00 Error 400, response body: '{"error":"InvalidRequest","message":"A request body is expected but none was provided"}'
I 2024-02-21 20:47:29.236916+00:00 Returning 400: {'error': 'Error: {"error":"InvalidRequest","message":"A request body is expected but none was provided"} 400 Client Error: Bad Request for url: https://bsky.social/xrpc/com.atproto.repo.uploadBlob'} {'response': <Response 212 bytes [200 OK]>}
E 2024-02-21 20:47:29.237247+00:00 <class '[werkzeug.exceptions.BadRequest](http://werkzeug.exceptions.badrequest/)'>: 400 Bad Request: {'error': 'Error: {"error":"InvalidRequest","message":"A request body is expected but none was provided"} 400 Client Error: Bad Request for url: https://bsky.social/xrpc/com.atproto.repo.uploadBlob'}

Getting A request body is expected but none was provided back from Bluesky which is pretty odd. This seems quite random, a retry will pretty much always fix it.

snarfed commented 4 months ago

Ugh, transient failures are the worst. Thanks for reporting and starting the sleuthing!

JoelOtter commented 2 months ago

So the reason a retry always fixes it is because this only happens when it has to re-fetch the token using the app password. Haven't found the root-cause in the code but hopefully it won't be too hard to track down.

JoelOtter commented 2 months ago

Hmm. So looking at these logs and reading I'm wondering if there's actually a race condition between doing the reauth (as it's a callback...?) and trying to do the upload. Though, it does appear like there might be a retry happening too, I can't find where that happens in the code. Any pointers? My Python's got rusty again 😆

snarfed commented 2 months ago

Hmm! Thanks for the sleuthing. There shouldn't be any concurrency here, each HTTP request is single threaded and uses its own lexrpc client. You're right about the automatic refresh: when a call fails auth, the lexrpc client refreshes the token, then reruns it. There's probably a bug with that and calls like uploadBlob with binary payloads.

JoelOtter commented 2 months ago

Yeah so after some digging I think the issue lies within lexrpc, or at least the fact we're passing the binary blob as a stream. I don't know exactly how these streams are implemented in Bridgy (or Python, honestly) but if they're anything like how they work in other languages, the stream is being fully consumed on the first request, and then when it makes the retry it's doing so with an empty stream.

A quick fix here might be to change it so we don't use a streaming request when fetching the image's binary data. This would mean keeping the whole image in memory, which we may not want to do, though then again perhaps we're doing that anyway? I'm unsure.

snarfed commented 2 months ago

Yeah so after some digging I think the issue lies within lexrpc, or at least the fact we're passing the binary blob as a stream...the stream is being fully consumed on the first request, and then when it makes the retry it's doing so with an empty stream.

That sounds exactly right. If so, the fix should be straightforward. I'll look.

snarfed commented 2 months ago

I'm trying to reproduce this in https://github.com/snarfed/granary/pull/721, but I haven't managed to quite yet. That test passes. 😕

snarfed commented 2 months ago

If your hunch is right, I think https://github.com/snarfed/lexrpc/commit/407734c6631072bcc15d4c2d781dadc70b6f7ee8 might fix it, but I'm still struggling to come up with a test case that reproduces the failure, in either granary or lexrpc.

snarfed commented 2 months ago

Aha, figured it out, I needed to actually read from the input stream in the test. Did that in snarfed/granary@4cd86e6cf8a869b02d417d7bd8db6a9dd8a796c3 and now the test is failing as expected.

snarfed commented 2 months ago

OK @JoelOtter I've deployed the lexrpc fix, feel free to retry. 🤞 And thanks again for the sleuthing!

JoelOtter commented 2 months ago

Unfortunately I think this has broken normal upload: https://brid.gy/log?module=default&start_time=1714512353&key=agdicmlkLWd5cloLEg1QdWJsaXNoZWRQYWdlIjNodHRwczovL3d3dy5qb2Vsb3R0ZXIuY29tL25vdGVzLzIwMjQvMDQvMzAtZnJpZW5kcy8MCxIHUHVibGlzaBiAgISU0rqRCQw

Retry doesn't work - I can observe this behaviour using the preview form too!

snarfed commented 2 months ago

Ugh, I should have tested myself. Looking.

snarfed commented 2 months ago

For now, I've rolled back to the last version.

JoelOtter commented 2 months ago

Yep, working as before :)

snarfed commented 2 months ago

Weird though, I don't get the failure. It complains that something isn't seekable, but the only place we added a seek is after an auth failure and refreshSession call, which didn't happen here...?

https://github.com/snarfed/lexrpc/blob/d254aa0c263aab13c66c5d8c243a765cacd0f087/lexrpc/client.py#L216

JoelOtter commented 2 months ago

Huh, yeah, that is weird.

JoelOtter commented 2 months ago

Does the underlying implementation of the streaming query use seek and that doesn't work with BufferedRandom (which is always used regardless here)?

snarfed commented 2 months ago

Ah, no, BufferedRandom itself checks that the stream you give it is seekable itself. Argh, that defeats the purpose of the fix. OK so we need a different way to buffer it and reset the stream.

https://github.com/python/cpython/blob/7fabcc727dee52a3e0dfe4f903ad414e93cf2dc9/Lib/_pyio.py#L1402

snarfed commented 2 months ago

🤷

snarfed commented 2 months ago

Maybe we just cut out the fancy streaming and always read it into memory?

JoelOtter commented 2 months ago

I think if you’re buffering the stream and resetting it that is by necessity reading the whole thing into memory anyway right?

The only alternative to downloading the whole thing - which I would prefer not to do as it is a potential abuse vector - would be to refactor the lexrpc code such that it takes a function (or perhaps some special Python thing I’m not aware of similar to ‘__getattr’) which would allow this to effectively re-request the stream. That or we invert it such that the retry is user driven but that’s a big ol refactor and loses some neat functionality.

snarfed commented 2 months ago

I think if you’re buffering the stream and resetting it that is by necessity reading the whole thing into memory anyway right?

When we have to refresh a token, yes, but not otherwise. I don't have a good sense of how often a publish will need to refresh the token, and therefore buffer the whole file in memory.

Fortunately if images average 1-5MB, that's only ~1-2% of our memory budget, and Bluesky publishes with images are rare, and very obscure and unlikely as an abuse vector, so I'm not worried. Buffering in memory is fine, I'll just do that.

JoelOtter commented 2 months ago

Fair enough. :) I'd still suggest maybe setting some sensible limit like 25MB based on the Content-Length when it fetches the image from the source but before it streams in the rest of the chunks - it's trivial to generate a many-GB PNG, and this recent AWS stuff has me on edge, and I'd feel awful if you got hit with an enormous bill!

snarfed commented 2 months ago

One difficulty with using Content-Length is that it sometimes doesn't exist, eg Flickr doesn't serve it at all. No joke. 😠 #944

Fortunately the failure mode here in the too-big case is just that an instance runs out of memory, drains its in progress requests, and then restarts. Not a big deal.

snarfed commented 2 months ago

OK I've deployed the fix that just always reads into memory. Once mor time with feeling, feel free to try again!

JoelOtter commented 2 months ago

Not seen it fail since that change so I think we can call this resolved. :) Thank you!