openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
98 stars 13 forks source link

Intermittent slow/error nominatim responses #1069

Closed Tuxrug closed 2 months ago

Tuxrug commented 2 months ago

Another user and I had noticed an issue getting errors from the nominatim API. @mtmail noticed nominatim.openstreetmap.org usage graphs show a lot of slow queries today and suggested reporting it here. It looks like this is clearing up as I am now having trouble replicating it, however I am reporting it just in case it needs any further investigation.

Original issue: https://github.com/osm-search/Nominatim/issues/3405

Steps to reproduce:

Observed behavior:

tomhughes commented 2 months ago

It's just the US server I think - it looks like somebody is probably doing some scraping or something and it is overloaded.

I did try and investigate earlier when it was first reported but I couldn't find any sort of access log that would let me look for IPs to block so it probably needs @lonvia to deal with it.

lonvia commented 2 months ago

Looks like they've hit stormfly with 500 parallel connections. They are gone now, so feel free to close. I need to think about better monitoring for this kind of situation.

Logs are for historic reasons in different locations on the different servers. I've added a symlink in /var/log/nginx now to make them easier findable the next time.

tomhughes commented 2 months ago

I think the nftables rate limiting should have blocked that - more likely it was a small number of real connections with lots of multiplexed http2 streams much like we saw on the main site some weeks ago.

That said as you're using nginx it has much better support for rate limiting that ought to be usable to block that sort of thing I think.

Firefishy commented 2 months ago

Just an quick look at the logs...

Some other user-agents in top 24x are: YourAppName, my_app, python-requests/x.y.z, Java/x.y.z_aaa, "Chome" and tutorial which I would consider blocking as against usage policy.

Here is my code:

# Creating the top IP list
grep -F '" 200 ' /var/log/nominatim/nominatim.openstreetmap.org-access.log | cut -d ' ' -f 1 | sort -S 25% --parallel=4 | uniq -c | sort -S 25% --parallel=4 -nr | head -n 1000 | tee top-ips-nominatim-200response-20240501.txt
# Viewing sample queries from top IP list
for i in $(head -n 24 /home/grant/top-ips-nominatim-200response-20240501.txt|awk '{print $2}'); do echo "${i}"; tac /var/log/nominatim/nominatim.openstreetmap.org-access.log|grep -F "${i}" | head -n 10; done | less
lonvia commented 2 months ago

I've checked the logs now. This was a mass geocoder using approx. 580 servers from Google Cloud, each sending requests at a rate of a bit less than 1 request/s.

We are pretty well set when it comes to limiting requests from single IPs. It's just when people start using bot nets when things are failing. Thankfully it is rare to see it on this scale.