openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
99 stars 13 forks source link

Improve www/api DNS round robin "load balancer" #747

Open Firefishy opened 2 years ago

Firefishy commented 2 years ago

Currently we round-robin DNS load balance traffic over web frontend web servers.

We drain connections by removing DNS records and allowing the old TTL to expire.

We should improve the load balancer to be host status aware.

tomhughes commented 2 years ago

We can't improve the load balancer because there is no load balancer - we would have to introduce one.

Firefishy commented 2 years ago

Some options:

  1. Physical Dedicated Load Balancer (Think F5 or similar)
  2. Software Based Load Balancer running on dedicated hardware. (haproxy, nginx, caddy, envoy...)
  3. Software Based Load Balancer sharing hardware with web servers.
  4. Cloud based load balancer (Fastly, Cloudfront, etc)
  5. Dynamic DNS (Like nominatim.osm.org)
tomhughes commented 2 years ago

Where do I start...

Obviously we can put a load balancer in front - that of course just moves the single point of failure so you have to consider how to deal with a failed load balancer ;-)

Our current dynamic DNS would be too slow to react for a situation like the one which triggered this bout of hand wringing.

Reaction time in general is going to be an issue for the queue full errors I think because I think that page is quite slow to load so detecting that failure more is hard though we might be able to do it from prometheus monitoring of the queues rather than by looking at the page loads.

Load balancing is probably the last thing you want to do for those queue full errors though - load balancing would only help if the request traffic is reasonable but is somehow unbalanced but much more likely is that a single client is sending a large number of very slow requests in which case all load balancing is going to do is cascade the failure to the other servers by filling their queues as well.

The solution to the queue full errors is probably to find the underlying cause and add appropriate restrictions to stop people being able to flood the queue with slow requests.

Better load balancing would of course be useful in the case of something like a hardware failure taking a frontend out but that's probably a much rarer scenario.

tomhughes commented 2 years ago

In fact the incident on Sunday night filled the queues on all three machines so no amount of load balancing would have helped...

Something was causing a significant leap in CPU usage on the master database:

image

which in turn led to database access being slow and a queue to buildup.

tomhughes commented 2 years ago

Big spikes (in the millions) in the number of index scans on gpx_files:

image

so as I would have guessed somebody was doing something stupid with GPS traces.

tomhughes commented 2 years ago

The problem seems to have been somebody fetching the point cloud for a large, complex area with JOSM where each query was timing our but JOSM just tried to fetch more and more pages of points.

The web requests were timing out after five minutes but the actual database query wasn't cancelled and carried on running, putting load on the database and probably also keeping that rails worker out of action as the next request it took would block on the database.

I've added https://github.com/openstreetmap/openstreetmap-website/commit/09263bc4a1943a7e6bc8e50128b4667a66653cda to cancel database queries when a timeout occurs.

matthewdarwin commented 2 years ago

https://dnscaster.com/ has a nice solution for managing DNS with monitors to take endpoints out of the cluster. (I have no affiliation with them, just on of their customers)