openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
98 stars 13 forks source link

Timeouts with Github Actions CI connecting to git.openstreetmap.org #381

Closed Firefishy closed 4 years ago

Firefishy commented 4 years ago

Currently all OpenStreetMap deployments use git.openstreetmap.org as the primary deployment source.

We have had ongoing timeout issues between github actions CI test and git.openstreetmap.org. See: https://github.com/openstreetmap/chef/pull/286

We should consider making Github the primary deployment source.

lonvia commented 4 years ago

Please, no. That just adds another point of failure (and one that we have zero control over at that). And speaking of zero control: the Github openstreetmap orga has simply too many admins to make this a reliable source.

If the CI is having problems, fix the CI or switch to one that works.

tomhughes commented 4 years ago

Well unfortunately I don't administer the Azure network so there's not much I can do to fix it. Do you have a concrete suggestion for something we can switch to that offers the same level of parallelism as github actions for free? or even for a reasonable cost?

tomhughes commented 4 years ago

I'm also curious about where you see the problem with the five current admins of the openstreetmap organisation? There's only one I can see that might be an issue?

grischard commented 4 years ago

This does indeed add a potential point of failure, but it is an attempt by the sysadmins to work around a current, actual failure. Seems like a good compromise to me, and worth trying to see if it fixes the issue.

Will we still keep the two sides in sync? How much work would it be to switch back to git.openstreetmap.org if deploying from github fails?

tomhughes commented 4 years ago

Well it would make my life easier if I wasn't having to push everything to multiple places like I do now...

I've long thought the current situation is confusing and less than ideal and it's largely the result of political compromises at the time that we switched to git.

Switching back, or to some other git host, is trivial, it's just changing a few URLs in chef - see the PR last week for example.

The chef and dns repos will need to stay on git.osm.org because they are integrated with hooks that do deployment when we push - that could likely be fixed but would be a lot more work.

For chef it makes no difference anyway because the tests already use the github hosted version, and the DNS repo is small and I don't think causes many problems - it mostly seem to be the big repos that trigger it.

lonvia commented 4 years ago

@grischard The actual failure is in the CI, not on the operational side. "Compromising" on the production system side to fix a shortcoming in the CI system is... interesting. Now, if there are other reasons to change the current setup then that sure can be discussed but the original post only gives the CI as the only reason and I have very strong feelings about that.

As for fixing the current failure, I'd probably move the offending URLs into attributes and go for patching the repo paths when running the chef under test kitchen or under the broken CI in question.

That all said, I'm not involved in this whole chef testing project and not going to be, so feel free to ignore me. Just something to be aware of: if you get to testing the Nominatim cookbook, git.osm.org and github have different versions of the software.

tomhughes commented 4 years ago

Repurposing this ticket slightly as a tracker for the issues with GitHub Actions connecting to git.openstreetmap.org...

To summarise the current position we now have some minimal test cases, one using git and the other just using curl:

The multiple calls and the "sleep 2" in those is important - at two seconds they mostly work but occasionally fail. Reduce the delay and the second one will almost always fail.

I have also now captured packet traces at our end when it fails and what I see is that we get a SYN for the second connection and answer it with a SYN+ACK but then we just get retransmissions of the original packet so obviously the reply is getting lost somewhere. That persists until the client times out the connection.

So it appears that if two connections in quick succession are made to the same port on the same server then the second one times out - it just so happens that a git resource in chef actually invokes git twice and the second one fails.

Cases which I have tested and which don't trigger the problem:

Cases which do trigger it:

So it appears to be specific to making two connections to the same port on the same machine in that data centre and at that point I would blame the firewall we are behind there except that I can onlyreproduce it from an actions job - making two connections from other places doesn't seem to trigger it!

grischard commented 4 years ago

I've done some further digging.

The problem can be reproduced on ubuntu but can't be reproduced on macos - this might very well be because these requests are coming from different data centers. The ip addresses returned by icanhazip.com in the Ubuntu runs isn't in the Azure IP list, but seems to be EastUS. MacOS runs at MacStadium.

The tcpdump on macos shows exactly what you would expect. The tcpdump on Ubuntu seems strangely truncated?

mmd-osm commented 4 years ago

In case fixing the firewall isn't possible for some reason, git can be automatically retried via some external wrapper script (assuming test kitchen supports this approach).

tomhughes commented 4 years ago

Oh I can set retries on the git resource in our chef recipes, but that will then apply in production as well, though that is maybe not a bad thing.

The problem though is that chef does two git calls within a single invocation of the resource and there is no way to insert a gap between those, other than monkey patching chef or doing crazy shit like installing a wrapper script for git into the image.

Firefishy commented 4 years ago

It is affecting a lot more than just git, let us treat the cause (UCL firewall?) and not the symtoms.

grischard commented 4 years ago

Thanks to @chkimes and many other fantastic people at github, we’ve been able to narrow it down to the host of git.openstreetmap.org (University College London) not allowing source ports 1025 and 1026. Very weird, but very reliably reproducible. 1024 works, 1027 works, probably just to confuse us :).

# hping3 --count 2 -S -s 1025 -p 443 git.openstreetmap.org
HPING git.openstreetmap.org (eth0 193.60.236.20): S set, 40 headers + 0 data bytes

--- git.openstreetmap.org hping statistic ---
2 packets transmitted, 0 packets received, 100% packet loss
round-trip min/avg/max = 0.0/0.0/0.0 ms

This only works if you're running it from a non-NATed IP address, of course.

Next steps:

Firefishy commented 4 years ago

Wow! Thank you @grischard and @chkimes for the investigation!

Firefishy commented 4 years ago

This document specifically call out blocking 1025 and 1026 (seemingly destination) ports: https://community.jisc.ac.uk/library/janet-services-documentation/blocking-lan-service-ports

Would be good to test if some of the other ports on that page are also blocked.

mmd-osm commented 4 years ago

So the issue shouldn't appear anymore when manually setting the local port range to some different value (as root)? echo "40000 60000" > /proc/sys/net/ipv4/ip_local_port_range

Also, ip_local_reserved_ports might be useful to specifically exclude source ports1025 and 1026?

(fixing the firewall is still the best option)

tomhughes commented 4 years ago

That doesn't help with github because it's the remapped ports post NAT that are the issue and we can't control those.

mmd-osm commented 4 years ago

Ok, makes sense. For some reason, I'm only seeing the very first two packets being dropped (seq=0 and seq=1 are missing):

 sudo hping3 --count 10 -S -s 1025 -p 443 git.openstreetmap.org
HPING git.openstreetmap.org (eth0 193.60.236.20): S set, 40 headers + 0 data bytes
len=46 ip=193.60.236.20 ttl=51 DF id=0 sport=443 flags=SA seq=2 win=42340 rtt=57.9 ms
len=46 ip=193.60.236.20 ttl=51 DF id=0 sport=443 flags=SA seq=3 win=42340 rtt=57.7 ms
len=46 ip=193.60.236.20 ttl=51 DF id=0 sport=443 flags=SA seq=4 win=42340 rtt=57.5 ms
len=46 ip=193.60.236.20 ttl=51 DF id=0 sport=443 flags=SA seq=5 win=42340 rtt=57.3 ms
len=46 ip=193.60.236.20 ttl=51 DF id=0 sport=443 flags=SA seq=6 win=42340 rtt=57.1 ms
len=46 ip=193.60.236.20 ttl=51 DF id=0 sport=443 flags=SA seq=7 win=42340 rtt=56.9 ms
len=46 ip=193.60.236.20 ttl=51 DF id=0 sport=443 flags=SA seq=8 win=42340 rtt=56.7 ms
len=46 ip=193.60.236.20 ttl=51 DF id=0 sport=443 flags=SA seq=9 win=42340 rtt=56.5 ms

--- git.openstreetmap.org hping statistic ---
10 packets transmitted, 8 packets received, 20% packet loss
round-trip min/avg/max = 56.5/57.2/57.9 ms

Source port 1433 drops the first 2 packets, source port 1434 only the very first one.

grischard commented 4 years ago

hping3 auto-increases ports. Your seq=2 will be source port 1027, etc.

Try --keep if you want to stick with one source port.

mmd-osm commented 4 years ago

Reading manual pages really helps sometimes... 🤦‍♂️ So source ports 1433 and 1434 are also (permanently) blocked like mentioned on that blocking-lan-service-ports page.

mmd-osm commented 4 years ago

7547 is also blocked and not mentioned on their page. I think those should then be all the blocked source ports in the 1000..60'000 range. Strange stuff those folks are doing there.

 hping3 --count 10 -S -s 7547 --keep -p 443 git.openstreetmap.org
HPING git.openstreetmap.org (eth0 193.60.236.20): S set, 40 headers + 0 data bytes
^C
--- git.openstreetmap.org hping statistic ---
7 packets transmitted, 0 packets received, 100% packet loss

Port 7547 is used for TR-069 protocol (aka CWMP, CPE WAN Management Protocol). Again, they seem to have mixed up source and destination tcp ports here.

HolgerJeromin commented 4 years ago

Was someone confused with source and destination port while setup the firewall? Blocking traffic to 137 (netBIOS) seems a very good idea for example, but I am puzzled what could be wrong with connections from 137 to the https port.

chkimes commented 4 years ago

Was someone confused with source and destination port while setup the firewall?

I've investigated a very similar issue before, and that's exactly what happened in that case.

Firefishy commented 4 years ago

"There is an ACL on our JANET connection which we've now amended so you should be able to connect from those source ports now, please test and confirm."

grischard commented 4 years ago

I can confirm that these source ports are now working. Thanks again to everyone who helped us pinpoint and fix this simple but tricky issue!

mmd-osm commented 4 years ago

They forgot about unblocking source port 7547.

grischard commented 4 years ago

They forgot about unblocking source port 7547.

Egads! @Firefishy, can you email your friend?

Firefishy commented 4 years ago

They forgot about unblocking source port 7547.

I cannot replicate port 7547 issue, I suspect that is local to you.

mmd-osm commented 4 years ago

Confirmed, it's a local thing, it works from another server:


sudo hping3 --count 10 -S -s 7547 --keep -p 443 git.openstreetmap.org
HPING git.openstreetmap.org (enp0s31f6 193.60.236.20): S set, 40 headers + 0 data bytes
len=46 ip=193.60.236.20 ttl=48 DF id=0 sport=443 flags=SA seq=0 win=42340 rtt=1039.9 ms
DUP! len=46 ip=193.60.236.20 ttl=48 DF id=0 sport=443 flags=SA seq=0 win=42340 rtt=2040.0 ms
DUP! len=46 ip=193.60.236.20 ttl=48 DF id=0 sport=443 flags=SA seq=0 win=42340 rtt=3035.9 ms
DUP! len=46 ip=193.60.236.20 ttl=48 DF id=0 sport=443 flags=SA seq=0 win=42340 rtt=4039.9 ms
DUP! len=46 ip=193.60.236.20 ttl=48 DF id=0 sport=443 flags=SA seq=0 win=42340 rtt=5040.0 ms
Firefishy commented 4 years ago

Thank you all for the help in getting this fixed. ❤️ https://twitter.com/OSM_Tech/status/1262686483288338433