Open NyanKiyoshi opened 1 year ago
(pasting an older reply from the PSRT list):
The main thing I think we need to do from a CPython perspective:
Picking a default timeout value when unspecified would make sense only if this were deemed a serious issue. That could break existing code. A slow network DoS usually isn't that level of serious.
Even the bug fix to spread our timeout across the entire list could be disruptive to existing code if it assumes the existing behavior and expects it to be per-host. Which is why back-porting that kind of API change to old releases may be questionable for us.
Consider if there is a definition of "reasonable limit" for the number of addresses returned from a gethostbyname to limit our use of. 500 addresses is clearly on silly side of huge.
To add to this point: I have been looking into whether there is a limit of A
records a DNS query could receive. The maximum number of IPs I was able to return was 4095 A
records (TCP + message compression feature from RFC 1035).
This would be a big problem for timeout=None
(default value), as it's up to the system to decide when to give up per IP address (60s/IP when testing on Ubuntu 22.04, thus process would hang for 68 hours).
Picking a default timeout value when unspecified would make sense only if this were deemed a serious issue. That could break existing code. A slow network DoS usually isn't that level of serious.
We could also consider adding sys.set_default_connect_timeout()
to allow to change None
default to what the application deems an acceptable default (slightly similar to CVE-2020-10735 which added sys.set_int_max_str_digits()
).
This would give a safe-guard for applications that may forget to set non-null timeout while accepting arbitrary domains.
More crazy idea:
We still need to split addresses on batches of limited size, because the number of open file descriptors is limited and because we do not want to make the Python application the source of a DDOS attack, but this will improve responsibility in more common cases: allow you to get a result faster in case the first candidate is not available, and to succeed on a slow network with less timeout.
A "try to create a few connections at once" concept reminds me of the https://en.wikipedia.org/wiki/Happy_Eyeballs algorithm. Which was basically put in place for user facing latency sensitive areas to pick the connection that works without waiting for a timeout. Specifically to pick IPv6 vs IPv4 so that v6 can be preferred when it appears to be working without consequences.
I'm generally wary of doing that kind of this automatically in a low level socket API by default. It is common for services to list multiple addresses in DNS, but that doesn't mean they want the load of multiple short lived connection attempts from everybody who's trying to use them.
I'm not aware of people doing this for addresses of the same type (I'm sure someone does). It is generally assumed that all addresses of one protocol are all routeable, if any are, and they all go to high availability frontends so trying multiple at once would be optimizing to avoid most services least likely situation of the first one a client chose not responding.
Bug report
This issue was initially reported to Python Security Team, and they decided to make the issue public.
socket.create_connection(address, timeout[, ...])
function can be abused to carry-out DoS attacks if theaddress
hostname resolves to many un-routable IP addresses.The issue comes from the
timeout
parameter being wrongfully the same for each IP address that it needs to try to connect to. This results to the function to potentially not respect the provided timeout value (CPython code).For example if we were to have a hostname resolving to 20 un-routable IPs, with
timeout=2
it would take 40 seconds instead of two seconds.This was introduced in Python 2.6, and affects all versions released since then.
PoC
Attached is a minimal reproducible example with extra-logging to help understand what is Python doing.
In the example:
socket.getaddrinfo()
is monkey-patched to make it return 10 non-routable IP addresses (this should work unless a firewall is refusing connection instead of dropping);socket.create_connection()
.Python Code
```python import socket, sys, logging # The number of IP addresses to return when resolving the hostname. IP_COUNT = 10 # Extra logging. def audit_hook(event, args): if event == "socket.connect": logging.warning(f"{event}: {args[-1]}") logging.basicConfig(format="%(asctime)s %(levelname)s: %(message)s") sys.addaudithook(audit_hook) # Fake DNS resolving to make it easier to reproduce. def getaddrinfo(*_args, **_kwargs): return [(socket.AF_INET, socket.SOCK_STREAM, 6, "", ("10.0.0.0", 80))] * IP_COUNT socket.getaddrinfo = getaddrinfo # Vulnerable code: socket.create_connection(("dos.test", 80), timeout=2) ```This results to Python trying each of the 10 IP addresses with a connect-timeout of two seconds for each of them. Instead of
socket.create_connection()
taking only two seconds to give-up, it takes 20 seconds as shown below:Output (Python 3.10.6)
``` # time python3 poc.py 2023-06-26 08:40:20,971 WARNING: socket.connect: ('10.0.0.0', 80) 2023-06-26 08:40:22,976 WARNING: socket.connect: ('10.0.0.0', 80) 2023-06-26 08:40:24,980 WARNING: socket.connect: ('10.0.0.0', 80) 2023-06-26 08:40:26,986 WARNING: socket.connect: ('10.0.0.0', 80) 2023-06-26 08:40:28,990 WARNING: socket.connect: ('10.0.0.0', 80) 2023-06-26 08:40:30,993 WARNING: socket.connect: ('10.0.0.0', 80) 2023-06-26 08:40:32,996 WARNING: socket.connect: ('10.0.0.0', 80) 2023-06-26 08:40:35,001 WARNING: socket.connect: ('10.0.0.0', 80) 2023-06-26 08:40:37,004 WARNING: socket.connect: ('10.0.0.0', 80) 2023-06-26 08:40:39,011 WARNING: socket.connect: ('10.0.0.0', 80) Traceback (most recent call last): File "//poc.py", line 18, inIt was successfully tested against 500 IP records on a real production DNS server, in the above example it would take 16 minutes for it to give up instead of two seconds.
This number of records could be set much higher.
If needed, a "real" DNS server can be used through another PoC, which uses bind9: https://github.com/NyanKiyoshi/poc-cpython-gh-106283#usage.
Affected
Python >= 2.6 (including 3.x ones)
Impact: any component using
socket.create_connection()
accepting arbitrary hostnames could hang "forever" (such asurllib
,smtplib
, etc.).Proposed Fix
Golang is a good example on how the issue could be resolved, it follows RFC 8305 where it will share equally the connect-timeout value among all the IP addresses it has to try.
Key points (see net.sysDialer.dialSerial and net.partialDeadline source code):
timeout=0.5
, then it will only try 0.5 seconds and ignore the RFC recommendation for the 2s value.For example when giving Go v1.20.5 a timeout of 10 seconds the following behavior happens:
Reproducible via https://github.com/NyanKiyoshi/poc-cpython-gh-106283#example-non-vulnerable-implementation.
Linked PRs