loris-imageserver / loris

Loris IIIF Image Server
Other
209 stars 87 forks source link

Inexplicable 503 and 504 errors #497

Open minusdavid opened 4 years ago

minusdavid commented 4 years ago

We've started running Loris in production, and we're starting to notice occasional 503 and 504 errors.

It's mostly just a feeling at this point, but I think that Loris (in this case mod_wsgi) is getting hung, but the logs don't clearly say why. I'm mostly wondering if other people are having this issue. I think I saw @alexwlchan saying something like this on the WellcomeTrust Github, although I think he uses uwsgi and Nginx instead of mod_wsgi in Apache. The mod_wsgi author blames this sort of scenario on the app.

In test environments and for the majority of time in production, the Loris servers (2 round robin load balanced servers each with 5 single-threaded processes*) manage very well. But on occasion they seem to freeze up. At the moment, it looks like it happens bad enough about every 12 hours to the point of needing to kill the machine and bring up a new one.

The freezing doesn't happen during periods of high load either. The servers seem fine during their busiest times. It's actually often during the quietest times that we get the worst performance.

I'm going to try adding stack dump code to the WSGI file as per the mod_wsgi authors advice, but wondering if others are having these same problems.

*I used to run 10 processes with 15 threads as that was the default configuration but that seemed even worse

bcail commented 4 years ago

@minusdavid what version of Loris are you running? There's been at least one commit recently (https://github.com/loris-imageserver/loris/commit/8855cc958c8fbe7ac523e902a0424563b1da430e) that helped reduce our exceptions in production.

Please do post any stack traces you're able to get.

minusdavid commented 4 years ago

@bcail We're running 2.2.0, which is quite long in the tooth now. Planning to switch over to Python3 anyway, so planning to upgrade very soon.

I notice that the latest release is 2.3.3 from June 2018 (https://github.com/loris-imageserver/loris/releases), but there has been a lot of work done since then.

Could we get a new release posted or at least tagged?

minusdavid commented 4 years ago

Happy to do packaging work on my end, but just would like to know some boundaries for stability : ).

bcail commented 4 years ago

@minusdavid see #498.

minusdavid commented 4 years ago

You're a champion, @bcail

lsh-0 commented 4 years ago

how did you go with your crashes, @minusdavid ? did they improve after an upgrade?

We don't see freezes ourselves but we do get random 5xx responses from time to time. We added a workaround in nginx to re-request the same image in a different format if the upstream server (uwsgi+iiif) returned an error. After an upgrade (to 2.3.3) the 5xx responses stopped happening and a corrupted image was instead being produced so we upgraded again (to 3.0) and it's much better behaved.

I went to remove the workaround but it turns out we're still getting the occasional random 5xx response behind the scenes.

minusdavid commented 4 years ago

@lsh-0 we never did the upgrade as the funding ran out, so that Loris server still has lots of issues unfortunately. I hope one day that some money comes in and we do the upgrade though.

bcail commented 4 years ago

@lsh-0 Gad things are looking better with the upgrade to 3.0. I would highly recommend anyone getting 500 errors to make sure they're running 3.0. Since you're still getting the occasional 5xx response, could you please post any error or stacktrace from your logs? Thanks.