simonrob / email-oauth2-proxy

An IMAP/POP/SMTP proxy that transparently adds OAuth 2.0 authentication for email clients that don't support this method.
Apache License 2.0
786 stars 84 forks source link

sometimes doesn't recover from network errors on Linux #215

Closed jlevon closed 3 weeks ago

jlevon commented 8 months ago

I'm using the proxy as a systemd service, and if I, for example, unplug my router, then I get:

Dec 15 23:13:17 rent python3[481687]: Email OAuth 2.0 Proxy: Caught network error in IMAP server at [::]:1993 (unsecured) proxying outlook.office365.com:993 (SSL/TLS) - is there a network connection? Error type 
<class 'socket.gaierror'> with message: [Errno -2] Name or service not known

That's fine, of course, but the proxy never recovers from this state when networking is up again. I have to restart it.

jlevon commented 8 months ago

This was pilot error, sorry, I was misled by something indicating the network was resolving DNS when in fact it wasn't.

simonrob commented 8 months ago

No worries - thanks for following up.

jlevon commented 8 months ago

I might have spoken too soon, as I can currently see this:

jlevon@rent:~$ Dec 19 20:29:11 rent python3[117063]: Email OAuth 2.0 Proxy: Caught network error in IMAP server at [::]:1993 (unsecured) proxying outlook.office365.com:993 (SSL/TLS) - is there a network connection? Error type <class 'socket.gaierror'> with message: [Errno -2] Name or service not known

jlevon@rent:~$ cat /etc/resolv.conf 
nameserver 1.1.1.1
jlevon@rent:~$ dig outlook.office365.com

; <<>> DiG 9.18.18-0ubuntu0.22.04.1-Ubuntu <<>> outlook.office365.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35290
;; flags: qr rd ra; QUERY: 1, ANSWER: 7, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;outlook.office365.com.     IN  A

;; ANSWER SECTION:
outlook.office365.com.  58  IN  CNAME   ooc-g2.tm-4.office.com.
ooc-g2.tm-4.office.com. 58  IN  CNAME   outlook.ms-acdc.office.com.
outlook.ms-acdc.office.com. 58  IN  CNAME   LHR-efz.ms-acdc.office.com.
LHR-efz.ms-acdc.office.com. 8   IN  A   52.97.212.66
LHR-efz.ms-acdc.office.com. 8   IN  A   40.100.174.34
LHR-efz.ms-acdc.office.com. 8   IN  A   52.97.133.162
LHR-efz.ms-acdc.office.com. 8   IN  A   40.99.201.146

;; Query time: 19 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Tue Dec 19 20:29:49 GMT 2023
;; MSG SIZE  rcvd: 199

is it possible negative lookup has somehow got cached by the process?

Note that in other cases where I lose connectivity, it does recover. This might have something to do with openconnect connection loss.

simonrob commented 8 months ago

Thanks for following up. This reminds me of issues I had myself when using the proxy on a macOS laptop – it wouldn't always regain connectivity after sleep, and it seemed to be a Python issue rather than anything that the proxy itself was doing.

I never managed to find the root cause, so as a workaround I added support for macOS's SCNetworkReachability interface. When network loss is detected, all of the proxy's active connections are closed.

I haven't seen the issue again since adding this functionality. Do you know of a similar mechanism for your operating system?

jlevon commented 8 months ago

I don't know offhand of any such API on Linux. Python itself shouldn't be caching DNS, I don't think.

Since this only happens sometimes, I'm going to try strace() to see if I can winkle out the circumstances. It's definitely related to VPN teardown, but happens (sometimes) even when DNS is pointing at the right server again.

simonrob commented 7 months ago

Coincidentally I just ran into something similar to this after using a patchy internet connection for a few weeks. The proxy got into a state where it worked fine when tokens hadn't yet expired, but those requiring renewal were unable to resolve the token URL – every request would time out, even when there was a reliable internet connection.

I wondered whether this was because the urlopen call did not set an explicit timeout, and potentially a large set of previous attempts had hung forever. I've added an explicit timeout in df9927f as a result.

This is just a theory, and it may well be incorrect (for example, it's a little strange that backed-up connections that don't themselves time out might cause future ones to time out). I don't really have a reliable way of testing it, but it did remind me of this issue, so I thought I'd mention it.

jlevon commented 7 months ago

I've been watching like a hawk for problems since filing this ticket and predictably it's behaved flawlessly. Nonetheless, I've updated to include that change. If I can reproduce any issue again I'll update.

simonrob commented 6 months ago

Just wondering whether this has happened again since the last update?

jlevon commented 6 months ago

Nope! Feel free to close. It's either your theory above, or I've just been lucky. Either way I can file a new one if needed. Thanks!

simonrob commented 6 months ago

Good to know - let's hope it doesn't happen again, but if it does, feel free to reopen this issue so we have the context.

jlevon commented 6 months ago

Caught it again

jlevon@rent:~$ dig outlook.office365.com

; <<>> DiG 9.18.18-0ubuntu0.22.04.2-Ubuntu <<>> outlook.office365.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10304
;; flags: qr rd ra; QUERY: 1, ANSWER: 9, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;outlook.office365.com.     IN  A

;; ANSWER SECTION:
outlook.office365.com.  59  IN  CNAME   ooc-g2.tm-4.office.com.
ooc-g2.tm-4.office.com. 9   IN  A   52.97.133.226
ooc-g2.tm-4.office.com. 9   IN  A   52.97.211.210
ooc-g2.tm-4.office.com. 9   IN  A   52.97.146.194
ooc-g2.tm-4.office.com. 9   IN  A   40.99.213.34
ooc-g2.tm-4.office.com. 9   IN  A   52.98.207.18
ooc-g2.tm-4.office.com. 9   IN  A   40.100.174.210
ooc-g2.tm-4.office.com. 9   IN  A   40.99.150.146
ooc-g2.tm-4.office.com. 9   IN  A   52.97.133.194

;; Query time: 19 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Sat Feb 24 10:59:02 GMT 2024
;; MSG SIZE  rcvd: 211

jlevon@rent:~$ dig outlook.office365.com^C
jlevon@rent:~$ Feb 24 10:59:51 rent python3[233304]: Email OAuth 2.0 Proxy: Accepting new connection from [::ffff:127.0.0.1]:41930 to IMAP server at [::]:1993 (unsecured) proxying outlook.office365.com:993 (SSL/TLS)
Feb 24 10:59:51 rent python3[233304]: Email OAuth 2.0 Proxy: Caught network error in IMAP server at [::]:1993 (unsecured) proxying outlook.office365.com:993 (SSL/TLS) - is there a network connection? Error type <class 'socket.gaierror'> with message: [Errno -2] Name or service not known

jlevon@rent:~$ curl -sL https://movementarian.org | head
<!DOCTYPE html>
<html lang="en-us">
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="generator" content="Hugo 0.92.2" />

    <title>

I have a corefile if there's any chance post-mortem debugging could help, or I'm also happy to add any/all debug you might suggest to catch it again next time. (Unfortunately I rely on it so can't leave it broken for long!)

simonrob commented 6 months ago

Thanks for following up – yes, you might as well share the corefile just in case (though there's probably only a slim chance it'll be useful). I don't really have any suggestions about debugging you could add, and am still really puzzled as to what could be causing this, but will reopen the issue in case anyone else happens to encounter it.

One obvious solution is to hardcode the outlook.office365.com IP address, which I presume (other potential disadvantages aside) would stop this error occurring.

I would still like to stop this happening though, so I did a bit of searching – none of the leads were particularly useful, but there are a few suggestions:

The most useful thing would be to find a way to reliably replicate this, but perhaps one of these leads will get you there. Let's hope so!

jlevon commented 6 months ago

No unusual settings in gai.conf.

I've added much more debug to create_socket() - thanks. Will report back if and when it happens again. I'm a bit unclear on this code:

    def create_socket(self, socket_family=socket.AF_UNSPEC, socket_type=socket.SOCK_STREAM):
        # connect to whichever resolved IPv4 or IPv6 address is returned first by the system
        for a in socket.getaddrinfo(self.server_address[0], self.server_address[1], socket_family, socket.SOCK_STREAM):
            super().create_socket(a[0], socket.SOCK_STREAM)
            return

Why do you have a loop that will only ever iterate once here? Is this trying to account for len(a) == 0 or something?

jlevon commented 6 months ago

BTW I have a theory here, this particular host has no ipv6 connectivity. I suspect that in some situations when taking down openconnect, it ends up re-ordering such that ipv6 are first in the GAI results. When ipv4 connectivity is restored they end up at the end, but we never get to those working addresses.

simonrob commented 6 months ago

I think you're on to something here – initially this was a feature submitted in #140 by an external contributor, but it happened in parallel to my own development, so I merged the two versions. But it looks like I didn't fully capture the intent of that loop.

It might be worth trying the version below, then attempting to replicate the network issues you've outlined to see what debug output you get. If you see the Unable to create socket message at all, the current version of the proxy would have given a network error; this version will continue and try other resolved addresses:

    def create_socket(self, socket_family=socket.AF_UNSPEC, socket_type=socket.SOCK_STREAM):
        # look up address and create a socket for the first resolved IPv4 or IPv6 address that is successful
        for a in socket.getaddrinfo(self.server_address[0], self.server_address[1], socket_family, socket.SOCK_STREAM):
            try:
                super().create_socket(a[0], socket.SOCK_STREAM)
                return
            except OSError as e:
                Log.debug(self.info_string(), 'Unable to create socket', a, ':', e, '- trying next result')
        raise socket.gaierror(8, 'getaddrinfo failed: unable to resolve host')
jlevon commented 6 months ago

I'm running a chattier, fatal version of the above (since I want to catch it to prove the theory). It may take a long time to see it though.

I think you need to still re-raise the exception if you've run out of "a" to try BTW

jlevon commented 5 months ago

It was not that. With this diff:

diff --git a/emailproxy.py b/emailproxy.py
index 213735d..ff13c3d 100644
--- a/emailproxy.py
+++ b/emailproxy.py
@@ -1727,9 +1727,23 @@ class OAuth2ServerConnection(SSLAsyncoreDispatcher):

     def create_socket(self, socket_family=socket.AF_UNSPEC, socket_type=socket.SOCK_STREAM):
         # connect to whichever resolved IPv4 or IPv6 address is returned first by the system
-        for a in socket.getaddrinfo(self.server_address[0], self.server_address[1], socket_family, socket.SOCK_STREAM):
-            super().create_socket(a[0], socket.SOCK_STREAM)
-            return
+        Log.info("jlevon in create_socket()")
+        Log.info(f"{self.server_address=} {socket_family=}")
+        gai = socket.getaddrinfo(self.server_address[0], self.server_address[1], socket_family, socket.SOCK_STREAM)
+        Log.info(f"got gai of {gai=}")
+
+        # look up address and connect, selecting the first resolved IPv4 or IPv6 address that is successful
+        for i, a in enumerate(gai):
+            try:
+                Log.info(f"create_socket for {a=}")
+                super().create_socket(a[0], socket.SOCK_STREAM)
+                Log.info(f"create_socket for {a=} succeeded at {i=}")
+                if i != 0:
+                    raise Exception("jlevon only succeeded at {i=}")
+                return
+            except OSError as e:
+                Log.debug(self.info_string(), 'Unable to create socket', a, ':', e, '- trying next result')
+

     def info_string(self):
         debug_string = self.debug_address_string if Log.get_level() == logging.DEBUG else \

Happy case:

Mar 10 11:22:23 lent python3[1310]: 2024-03-10 11:20:19: jlevon in create_socket()
Mar 10 11:22:23 lent python3[1310]: 2024-03-10 11:20:19: self.server_address=('outlook.office365.com', 993) socket_family=<AddressFamily.AF_UNSPEC: 0>
Mar 10 11:22:23 lent python3[1310]: 2024-03-10 11:20:19: got gai of gai=[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('40.99.204.194', 993)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('40.101.81.130', 993)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('40.101.80.178', 993)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('40.99.204.98', 993)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2603:1026:c03:7067::2', 993, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2603:1026:c03:7065::2', 993, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2603:1026:c03:707f::2', 993, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2603:1026:c03:7076::2', 993, 0, 0))]
Mar 10 11:22:23 lent python3[1310]: 2024-03-10 11:20:19: create_socket for a=(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('40.99.204.194', 993))
Mar 10 11:22:23 lent python3[1310]: 2024-03-10 11:20:19: create_socket for a=(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('40.99.204.194', 993)) succeeded at i=0

Unhappy case that I reproduced today:

Mar 10 16:53:08 lent python3[1310]: Email OAuth 2.0 Proxy: Accepting new connection from [::ffff:127.0.0.1]:37436 to IMAP server at [::]:1993 (unsecured) proxying outlook.office365.com:993 (SSL/TLS)
Mar 10 16:53:08 lent python3[1310]: Email OAuth 2.0 Proxy: jlevon in create_socket()
Mar 10 16:53:08 lent python3[1310]: Email OAuth 2.0 Proxy: self.server_address=('outlook.office365.com', 993) socket_family=<AddressFamily.AF_UNSPEC: 0>
Mar 10 16:53:09 lent python3[1310]: Email OAuth 2.0 Proxy: Caught network error in IMAP server at [::]:1993 (unsecured) proxying outlook.office365.com:993 (SSL/TLS) - is there a network connection? Error type <class 'socket.gaierror'> with message: [Errno -2] Name or service not known
Mar 10 16:54:09 lent python3[1310]: Email OAuth 2.0 Proxy: Accepting new connection from [::ffff:127.0.0.1]:53824 to IMAP server at [::]:1993 (unsecured) proxying outlook.office365.com:993 (SSL/TLS)
...

That is the getaddrinfo() itself is raising the exception not any attempt to connect to a socket.

jlevon commented 5 months ago

Another python3 invocation at the same time:

python3 -c "import socket; print(socket.getaddrinfo('outlook.office365.com', 993, socket.AF_UNSPEC, socket.SOCK_STREAM))"
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('52.97.179.226', 993)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('40.100.174.194', 993)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('40.99.151.162', 993)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('52.98.207.130', 993)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2603:1026:c06:1408::2', 993, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2603:1026:c06:182a::2', 993, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2603:1026:c06:149a::2', 993, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2603:1026:c06:182e::2', 993, 0, 0))]

So this doesn't appear to be a problem generic to python in this case

jlevon commented 5 months ago

strace from the proxy process:

[pid  1552] sendto(3, "<14>Email OAuth 2.0 Proxy: jlevo"..., 53, 0, NULL, 0) = 53
[pid  1552] getpid()                    = 1310
[pid  1552] sendto(3, "<14>Email OAuth 2.0 Proxy: self."..., 121, 0, NULL, 0) = 121
[pid  1552] newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=593, ...}, 0) = 0
[pid  1552] openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 7
[pid  1552] newfstatat(7, "", {st_mode=S_IFREG|0644, st_size=661, ...}, AT_EMPTY_PATH) = 0
[pid  1552] lseek(7, 0, SEEK_SET)       = 0
[pid  1552] read(7, "127.0.0.1 localhost\n127.0.1.1 le"..., 4096) = 661
[pid  1552] read(7, "", 4096)           = 0
[pid  1552] close(7)                    = 0
[pid  1552] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 7
[pid  1552] setsockopt(7, SOL_IP, IP_RECVERR, [1], 4) = 0
[pid  1552] connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("ELIDED")}, 16) = 0
[pid  1552] poll([{fd=7, events=POLLOUT}], 1, 0) = 1 ([{fd=7, revents=POLLOUT}])
[pid  1552] sendto(7, "\250\227\1 \0\1\0\0\0\0\0\1\7outlook\toffice365\3c"..., 50, MSG_NOSIGNAL, NULL, 0) = 50
[pid  1552] poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLERR}])
[pid  1552] close(7)                    = 0
[pid  1552] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 7
[pid  1552] setsockopt(7, SOL_IP, IP_RECVERR, [1], 4) = 0
[pid  1552] connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("ELIDED2")}, 16) = 0
[pid  1552] poll([{fd=7, events=POLLOUT}], 1, 0) = 1 ([{fd=7, revents=POLLOUT}])
[pid  1552] sendto(7, "\250\227\1 \0\1\0\0\0\0\0\1\7outlook\toffice365\3c"..., 50, MSG_NOSIGNAL, NULL, 0) = 50
[pid  1552] poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLERR}])
[pid  1552] close(7)                    = 0

ELIDED and ELIDED2 are in RFC1918 10.0.0.0 and I suspect are the correct nameservers - but from when the VPN connection was up! They are no longer reachable but the python process is stuck on them.

jlevon commented 5 months ago
sudo resolvectl flush-caches

did not fix it, though I did not expect it to. Somehow this particular process has apparently managed to cache the previous resolvectl setting (since I expect the vpnc-script to modify things that way)

jlevon commented 5 months ago

Symptoms are a match for e.g. https://sourceware.org/bugzilla/show_bug.cgi?id=25420 but I am on 2.35 and that's fixed in 2.32

Nonetheless, I think I've proven to my satisfaction that this issue has nothing to do with emailproxy itself at all, so feel free to close the ticket if you like.

simonrob commented 5 months ago

Thanks for the detailed follow-ups. I agree with your conclusion that this is unrelated to the proxy, and is likely to be a problem whose root cause is somewhere else on the network stack.

I do think the proposed edit is worthwhile because it makes other potential network issues a bit more transparent. Before merging that, though, I wondered whether a small tweak to that approach might let us work around these issues from the proxy's side.

The current create_socket method had to be added to support IPv6. The previous version just called the underlying create_socket, which doesn't call getaddrinfo. Since I don't think you're using IPv6, it might be worth checking whether falling back to this works for you?

    def create_socket(self, socket_family=socket.AF_UNSPEC, socket_type=socket.SOCK_STREAM):
        # look up address and create a socket for the first resolved IPv4 or IPv6 address that is successful
        try:
            gai = socket.getaddrinfo(self.server_address[0], self.server_address[1], socket_family, socket.SOCK_STREAM)
        except OSError as e:
            # see: https://github.com/simonrob/email-oauth2-proxy/issues/215 - getaddrinfo can fail
            Log.debug(self.info_string(), 'Falling back to default socket; getaddrinfo failed:', e)
            super().create_socket(socket.AF_INET, socket.SOCK_STREAM)
            return

        for address in gai:
            try:
                super().create_socket(address[0], socket.SOCK_STREAM)
                return
            except OSError as e:
                Log.debug(self.info_string(), 'Unable to create socket', address, ':', e, '- trying next result')

        raise socket.gaierror(8, 'All socket creation attempts failed: unable to resolve host')
jlevon commented 5 months ago

Yes, the fix for the exception loop should be made regardless; it's not related to this ticket, but it does look like a bug.

Re the proposed change, this would probably not help as all name resolution is broken at this point in glibc. I spent some time in its bowels and got this far:

https://sourceware.org/pipermail/libc-alpha/2024-March/155234.html

To my eyes this is a plain glibc bug but we'll see.

jlevon commented 5 months ago

Now filed here: https://sourceware.org/bugzilla/show_bug.cgi?id=31476

simonrob commented 1 month ago

This comment is just a reminder to myself that this issue can be closed once glibc 2.40 is released (scheduled for August 2024).

Thanks @jlevon for the fascinating blog post that documents the underlying issue in much more detail, and is well worth a read for anyone interested here.

simonrob commented 3 weeks ago

I've just seen that glibc 2.40 has been released, so will close this issue now that the underlying bug has been resolved. Thanks again @jlevon for the dedication to fixing this!