Replication connection timeout

BathoryPeter commented 5 months ago

Osmosis fails to download minute diffs from the planet server. Not all, but most update attempts run into a connection timeout.

INFO: Reading current server state. [ReplicationState(timestamp=Mon Jun 10 09:20:03 CEST 2024, sequenceNumber=6127042)]
[2024-06-10 09:22:01] 117880 pid 117828 still running                                                    
[2024-06-10 09:23:01] 117958 pid 117828 still running                                                    
Jun 10, 2024 9:23:14 AM org.openstreetmap.osmosis.core.pipeline.common.ActiveTaskManager waitForCompletion
SEVERE: Thread for task 1-read-replication-interval failed                                               
org.openstreetmap.osmosis.core.OsmosisRuntimeException: Unable to read the state from the server.        
        at org.openstreetmap.osmosis.replication.common.ServerStateReader.getServerState(ServerStateReader.java:95)
        at org.openstreetmap.osmosis.replication.common.ServerStateReader.getServerState(ServerStateReader.java:60)
        at org.openstreetmap.osmosis.replication.v0_6.BaseReplicationDownloader.download(BaseReplicationDownloader.java:218)
        at org.openstreetmap.osmosis.replication.v0_6.BaseReplicationDownloader.runImpl(BaseReplicationDownloader.java:293)
        at org.openstreetmap.osmosis.replication.v0_6.BaseReplicationDownloader.run(BaseReplicationDownloader.java:372)
        at java.base/java.lang.Thread.run(Thread.java:829)                                               
Caused by: java.net.ConnectException: Connection timed out (Connection timed out)

Lowering the maxInterval in configuration.txt to 60s helps a bit, but increasing to 1h always results a timeout. Checked on my production server and local PC, the result was the same.

I first noticed the issue on 2024-06-09 at 0:15 UTC.

BathoryPeter commented 5 months ago

I can see a significant drop in S3 graphs.

tomhughes commented 5 months ago

Well you seem to have narrowed in on one tiny window - if you look at the last 24 hours it all looks normal and all our replication feeds are running fine so it seems to be an issue specific to your connection to AWS.

BathoryPeter commented 5 months ago

The issue is still present. With maxInterval=120 about half of the requests times out, and my replag is continuously increasing:

replag-pinpoint=1717884000,1718005819

My server connects from Frankfurt, but I experiencing the same problem here, from Budapest, Hungary.

tomhughes commented 5 months ago

As I say we have machines in at least six locations on five different networks that are pulling from the feed with no problem - they're using osmium not osmosis of course but I don't see why that would make a difference.

BathoryPeter commented 5 months ago

Well you seem to have narrowed in on one tiny window

The drop on the graph coincides exactly with the first error in my logs.

Would a verbose osmosis log help?

tomhughes commented 5 months ago

Well you seem to have narrowed in on one tiny window

The drop on the graph coincides exactly with the first error in my logs.

There are brief ups and downs all the time though - the long term average clearly doesn't show any significant decrease.

Would a verbose osmosis log help?

No, it would not. None of us have used osmosis for years and in any case it's a network timeout so what exactly do you expect a verbose log to show? There is a problem with packets from your network getting to and/or from Amazon and there is nothing much we can do to help with that.

BathoryPeter commented 5 months ago

Hmm, I did an attempt replacing baseUrl to amazon, and that completely solved the problem:

#baseUrl=https://planet.openstreetmap.org/replication/minute/
baseUrl=https://osm-planet-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com/planet/replication/minute
maxInterval=3600

tomhughes commented 5 months ago

So your problem is reaching he.net in Amsterdam then by the sounds of it.

Firefishy commented 5 months ago

@BathoryPeter Please could you run a traceroute planet.openstreetmap.org or a mtr --report-wide --report-cycles 10 planet.openstreetmap.org

BathoryPeter commented 5 months ago

From ~~Frankfurt~~ Düsseldorf:

traceroute to planet.openstreetmap.org (184.104.179.145), 30 hops max, 60 byte packets
 1  ip-161-97-128-11.static.contabo.net (161.97.128.11)  1.512 ms  1.488 ms  1.510 ms
 2  et-4-0-8.edge6.Dusseldorf1.Level3.net (62.67.22.193)  1.566 ms 10.0.50.1 (10.0.50.1)  1.411 ms  1.279 ms
 3  et-4-0-8.edge6.Dusseldorf1.Level3.net (62.67.22.193)  1.351 ms ae2.3210.edge4.frf1.neo.colt.net (171.75.9.147)  4.772 ms  4.751 ms
 4  ae2.3210.edge4.frf1.neo.colt.net (171.75.9.147)  4.723 ms e0-5.core2.fra1.he.net (216.66.87.197)  5.201 ms ae2.3210.edge4.frf1.neo.colt.net (171.75.9.147)  4.690 ms
 5  e0-5.core2.fra1.he.net (216.66.87.197)  5.677 ms  5.854 ms *
 6  * port-channel1.core3.fra1.he.net (184.104.198.26)  4.823 ms *
 7  port-channel2.core3.fra2.he.net (72.52.92.70)  5.476 ms * *
 8  openstreetmap-foundation.port-channel7.switch2.ams2.he.net (184.104.202.70)  8.809 ms  8.792 ms *
 9  openstreetmap-foundation.port-channel7.switch2.ams2.he.net (184.104.202.70)  13.858 ms * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

Start: 2024-06-10T10:51:17+0200
HOST: carto-map                                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2a02:c206::a                                                0.0%    10    1.1   1.3   1.1   1.8   0.2
  2.|-- ge-7-0-6.bar1.Munich1.Level3.net                            0.0%    10    1.3   5.6   1.2  21.7   6.9
  3.|-- lo-0-0-v6.edge4.Frankfurt1.Level3.net                       0.0%    10    9.7   5.3   4.6   9.7   1.6
  4.|-- e0-6.core2.fra1.he.net                                     10.0%    10    5.8   6.1   5.3  10.4   1.6
  5.|-- ???                                                        100.0    10    0.0   0.0   0.0   0.0   0.0
  6.|-- ???                                                        100.0    10    0.0   0.0   0.0   0.0   0.0
  7.|-- ???                                                        100.0    10    0.0   0.0   0.0   0.0   0.0
  8.|-- openstreetmap-foundation.port-channel7.switch2.ams2.he.net  0.0%    10   13.3  11.4   8.8  13.8   1.9
  9.|-- norbert.openstreetmap.org                                   0.0%    10    7.5   7.6   7.5   8.0   0.1

mmd-osm commented 5 months ago

pa5cal commented 5 months ago

Thanks for linking me here @mmd-osm !

I also get a lot of connection timeouts when using OSMOSIS and other tools for minutely and changeset diffs .

other_diff_status-day

Today I temporarily switched to https://download.openstreetmap.fr/replication/planet/minute

Firefishy commented 5 months ago

It appears there may have been an issue with apache on the planet.openstreetmap.org webserver. The log was being flooded by AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit. but apache and server otherwise appeared ok.

I have restarted apache and the logged error has gone away for now.

BathoryPeter commented 5 months ago

I can confirm that the issue is gone.

pa5cal commented 5 months ago

Thank you very much, @Firefishy !

I have not had a single timeout in the last 15 minutes. At least my services are running normally again and the downloads are available as fast as usual.

For your information: At least on my server, the timeouts described here occur about every three months. As mentioned, they disappear after about 24 hours. I don't know if the Apache has been restarted or something.

Firefishy commented 5 months ago

I suspect this is due to a faulty version of apache, we run a custom build to workaround some other apache bugs. We move back to distro release in Debian 12 and/or Ubuntu 24.04

tomhughes commented 5 months ago

I don't think it's custom as such, it's just a backport of a later version.

Firefishy commented 5 months ago

Closing. If the issue returns feel free to re-open ticket.

openstreetmap / operations

Replication connection timeout #1097