status-im / infra-eth-cluster

Infrastructure for Status-go fleets
https://github.com/status-im/status-go
0 stars 0 forks source link

SSL request failing within containers #34

Closed jakubgs closed 4 years ago

jakubgs commented 4 years ago

We're having an issue with SSL requests failing from within containers like this:

admin@mail-01.do-ams3.eth.test:~ % d exec -it statusd-mail sh
/ # curl -sv -XPOST https://google.com/
*   Trying 108.177.127.102:443...
* TCP_NODELAY set
* Connected to google.com (108.177.127.102) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to google.com:443 
* Closing connection 0

This was originally identified while working on Push Notifications via https://gorush.status.im/api/push.

jakubgs commented 4 years ago

I've tested using the same image locally on my machine and the requests work just fine:

 > d run --rm -it --entrypoint=/bin/bash statusteam/status-go:deploy-test
bash-5.0# apk add curl
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/community/x86_64/APKINDEX.tar.gz
(1/3) Installing nghttp2-libs (1.40.0-r1)
(2/3) Installing libcurl (7.67.0-r0)
(3/3) Installing curl (7.67.0-r0)
Executing busybox-1.31.1-r9.trigger
OK: 9 MiB in 22 packages
bash-5.0# curl -s -XPOST https://gorush.status.im/api/push
{"code":400,"message":"Missing notifications field."}
jakubgs commented 4 years ago

I've tried upgrading the Docker images used when building statusteam/status-go to golang:1.14-alpine and tried downgrading to alpine:3.11.6 but it doesn't seem to have any effect.

jakubgs commented 4 years ago

It appears the issue affects a clean Alpine image too:

admin@mail-01.do-ams3.eth.test:~ % d run --rm -it alpine:3.11.6
/ # apk add curl
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/community/x86_64/APKINDEX.tar.gz
(1/4) Installing ca-certificates (20191127-r2)
(2/4) Installing nghttp2-libs (1.40.0-r1)
(3/4) Installing libcurl (7.67.0-r0)
(4/4) Installing curl (7.67.0-r0)
Executing busybox-1.31.1-r9.trigger
Executing ca-certificates-20191127-r2.trigger
OK: 7 MiB in 18 packages
/ # curl -sv -XPOST https://gorush.status.im/api/push
*   Trying 172.67.10.161:443...
* TCP_NODELAY set
* Connected to gorush.status.im (172.67.10.161) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to gorush.status.im:443 
* Closing connection 0

Same goes for alpine:latest.

jakubgs commented 4 years ago

Some related issues:

jakubgs commented 4 years ago

This is not specific just to the Alpine image, Ubuntu does the same:

root@06e526a2afaf:/# cat /etc/issue
Ubuntu 20.04.1 LTS \n \l

root@06e526a2afaf:/# curl -vs https://google.com
*   Trying 74.125.128.138:443...
* TCP_NODELAY set
* Connected to google.com (74.125.128.138) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to google.com:443 
* Closing connection 0
jakubgs commented 4 years ago

It appears eth.staging has the same issue:

admin@mail-01.do-ams3.eth.staging:~ % d exec -it statusd-mail bash
...
bash-4.4# curl -sv https://google.com
* Expire in 0 ms for 6 (transfer 0x56190b976500)
...
* Expire in 2 ms for 1 (transfer 0x56190b976500)
*   Trying 108.177.127.139...
* TCP_NODELAY set
* Expire in 149997 ms for 3 (transfer 0x56190b976500)
* Expire in 200 ms for 4 (transfer 0x56190b976500)
* Connected to google.com (108.177.127.139) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to google.com:443 
* Closing connection 0
jakubgs commented 4 years ago

I've re-created the mail-01.do-ams3.eth.test host using the new Digital Ocean Ubuntu 20.04 image: ubuntu-20-04-x64 And now it appears to work on that host:

admin@mail-01.do-ams3.eth.test:~ % d run --rm -it alpine:3.11.6
/ # apk add curl
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/community/x86_64/APKINDEX.tar.gz
(1/4) Installing ca-certificates (20191127-r2)
(2/4) Installing nghttp2-libs (1.40.0-r1)
(3/4) Installing libcurl (7.67.0-r0)
(4/4) Installing curl (7.67.0-r0)
Executing busybox-1.31.1-r9.trigger
Executing ca-certificates-20191127-r2.trigger
OK: 7 MiB in 18 packages
/ # curl -s -XPOST https://gorush.status.im/api/push
{"code":400,"message":"Missing notifications field."}

Which shows that the issue must be with the host system.

jakubgs commented 4 years ago

I've verified this issue is also present on boot-01.do-ams3.eth.test. I will replace the other waku/history nodes to fix this. I'll use the other ones to debug this further.

jakubgs commented 4 years ago

I tried purging and reinstalling ca-certificates but that did nothing.

I also tried setting:

DOCKER_OPTS="--mtu 1400"

In /etc/default/docker as recommended in https://github.com/moby/moby/issues/2011 but that did nothing too.

jakubgs commented 4 years ago

I've replace all the waku/history nodes in eth.test and it was fine, but now it's not fine again...

jakubgs commented 4 years ago

Here's the section of an strace of curl -s https://google.com:

getpid()                                = 26
getpid()                                = 26
write(3, "\26\3\1\2\0\1\0\1\374\3\3\274q\252\213\266\345{\257\341!\346\224\220\245\2550\243y[&u"..., 517) = 517
read(3, 0x55e4a05e1b03, 5)              = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f6f716dd27d}, NULL, 8) = 0
poll([{fd=3, events=POLLIN}], 1, 0)     = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f6f716dd27d}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f6f716dd27d}, NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f6f716dd27d}, NULL, 8) = 0
poll([{fd=3, events=POLLIN}], 1, 111)   = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f6f716dd27d}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f6f716dd27d}, NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f6f716dd27d}, NULL, 8) = 0
poll([{fd=3, events=POLLIN}], 1, 1)     = 0 (Timeout)
...

And it repeats that until it hits:

rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f460350d27d}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f460350d27d}, NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=3, revents=POLLIN|POLLRDNORM}])
read(3, "", 5)                          = 0
madvise(0x55d53ac04000, 1118208, MADV_DONTNEED) = 0
close(3)                                = 0
jakubgs commented 4 years ago

In comparison on the host system it looks like this:

getpid()                                = 2143
getpid()                                = 2143
write(5, "\26\3\1\2\0\1\0\1\374\3\3\353l\23\355\273?\32w\0032;\3623\3639g\10\336T\262."..., 517) = 517
read(5, 0x55c387b27183, 5)              = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f17458573c0}, NULL, 8) = 0
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}], 2, 179) = 1 ([{fd=5, revents=POLLIN}])
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f17458573c0}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f17458573c0}, NULL, 8) = 0
poll([{fd=5, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=5, revents=POLLIN|POLLRDNORM}])
read(5, "\26\3\3\0z", 5)                = 5
read(5, "\2\0\0v\3\3\312\24V\335\361\322BP\21\257\364\335\t3%\307\213\257#\3415{Gsu\344"..., 122) = 122
read(5, "\24\3\3\0\1", 5)               = 5
read(5, "\1", 1)                        = 1
read(5, "\27\3\3\16]", 5)               = 5
read(5, "\247\316I \vE\257P\371S-\237\2646\251m\223\366u<\324\306\336\267\r|\241k\336\202\203\32"..., 3677) = 3677
stat("/etc/ssl/certs/99bdd351.0", 0x7fff40cdda30) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 6
fstat(6, {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
read(6, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\0"..., 4096) = 118

And then continues on to read the response from the server.

jakubgs commented 4 years ago

I built the reproduction program from https://github.com/openssl/openssl/issues/10880 but it appears to not be that issue.

/repo # ./tls13test
read: 4080 (should be 4080)
jakubgs commented 4 years ago

Even just calling openssl fails:

/ # openssl s_client -cipher ALL -servername google.com -connect google.com:443
CONNECTED(00000003)
write:errno=0
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 400 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
jakubgs commented 4 years ago

What's interesting is that when I remove this line from /etc/docker/daemon.json:

  "userns-remap": "dockremap",

And disable user remapping and restart docker, then I get:

/ # curl -sv https://status.im
*   Trying 172.67.10.161:443...
* connect to 172.67.10.161 port 443 failed: Connection refused
*   Trying 104.22.24.181:443...
* connect to 104.22.24.181 port 443 failed: Connection refused
*   Trying 104.22.25.181:443...
* connect to 104.22.25.181 port 443 failed: Connection refused
*   Trying 2606:4700:10::6816:18b5:443...
* Immediate connect fail for 2606:4700:10::6816:18b5: Address not available
*   Trying 2606:4700:10::6816:19b5:443...
* Immediate connect fail for 2606:4700:10::6816:19b5: Address not available
*   Trying 2606:4700:10::ac43:aa1:443...
* Immediate connect fail for 2606:4700:10::ac43:aa1: Address not available
*   Trying 2606:4700:10::6816:18b5:443...
* Immediate connect fail for 2606:4700:10::6816:18b5: Address not available
*   Trying 2606:4700:10::6816:19b5:443...
* Immediate connect fail for 2606:4700:10::6816:19b5: Address not available
*   Trying 2606:4700:10::ac43:aa1:443...
* Immediate connect fail for 2606:4700:10::ac43:aa1: Address not available
* Failed to connect to status.im port 443: Connection refused
* Closing connection 0

And for a single IP:

/ # curl -sv https://172.67.10.161
*   Trying 172.67.10.161:443...
* connect to 172.67.10.161 port 443 failed: Connection refused
* Failed to connect to 172.67.10.161 port 443: Connection refused
* Closing connection 0

But I can ping it:

/ # ping -c4 172.67.10.161
PING 172.67.10.161 (172.67.10.161): 56 data bytes
64 bytes from 172.67.10.161: seq=0 ttl=59 time=2.613 ms
64 bytes from 172.67.10.161: seq=1 ttl=59 time=2.100 ms
64 bytes from 172.67.10.161: seq=2 ttl=59 time=1.129 ms
64 bytes from 172.67.10.161: seq=3 ttl=59 time=1.181 ms

--- 172.67.10.161 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 1.129/1.755/2.613 ms

Which indicates some more severe issue.

jakubgs commented 4 years ago

I tried clearing all iptables rules with:

iptables -P INPUT ACCEPT
iptables -P OUTPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -F

But it did nothing, so it doesn't seem to be related to the firewall rules.

jakubgs commented 4 years ago

This is interesting. The SSL port appears to be closed when checked with nmap from within the container:

a/ # nmap -Pn -p80,443 108.177.127.139
Starting Nmap 7.80 ( https://nmap.org ) at 2020-08-25 14:27 UTC
Nmap scan report for 108.177.127.139
Host is up (0.00082s latency).

PORT    STATE  SERVICE
80/tcp  open   http
443/tcp closed https

Nmap done: 1 IP address (1 host up) scanned in 0.09 seconds

But works fine from the host:

admin@mail-01.do-ams3.eth.test:~ % nmap -Pn -p80,443 108.177.127.139
Starting Nmap 7.80 ( https://nmap.org ) at 2020-08-25 14:27 UTC
Nmap scan report for 108.177.127.139
Host is up (0.0042s latency).

PORT    STATE SERVICE
80/tcp  open  http
443/tcp open  https

Nmap done: 1 IP address (1 host up) scanned in 0.06 seconds

And what's more it appears as closed and not filtered so it's not the firewall.

jakubgs commented 4 years ago

I tried connecting with netcat and watching with tcpdump and this is what I saw:

/ # nc 108.177.127.139 443
/ # 
14:41:02.960200 IP cbe9cb48413c.32773 > 108.177.127.139.443: Flags [S], seq 3358328996, win 64240, options [mss 1460,sackOK,TS val 1037680411 ecr 0,nop,wscale 7], length 0
14:41:02.960247 IP 108.177.127.139.443 > cbe9cb48413c.32773: Flags [R.], seq 0, ack 3358328997, win 0, length 0

Which clearly shows that we get back a RST packet, which makes no sense.

cammellos commented 4 years ago

Which clearly shows that we get back a RST packet, which makes no sense.

I think it's should be, if you send malformed data or hangs you will get back a RST, or I am mistaking? (on 443)

jakubgs commented 4 years ago

But we haven't sent any data yet, this is just the TCP handshake being started with SYN and immediately aborted with RST.

Why does the remote abort it on first step? No clue yet.

jakubgs commented 4 years ago

Unless you mean the SYN packet is in some way malformed, which I guess is possible.

cammellos commented 4 years ago

ah no, I thought you would get a full-handshake and then receive a rst after a period of inactivity, which looked legit, but if this is straight after SYN it means that no one is listening on that port, so OS is sending back a RST (as you pointed out, not filtered)

jakubgs commented 4 years ago

I can't tell if the issue with userns-remap disabled is caused by the same thing as when it's enabled, but my instinct tells me it is.

jakubgs commented 4 years ago

I turned userns-remap back on and tried to do some additional debugging with openssl command.

When I run it in the container I see this:

/ # openssl s_client -connect www.test.com:443 -prexit -debug
CONNECTED(00000003)
write to 0x5595c5706d60 [0x5595c57f7a60] (314 bytes => 314 (0x13A))
0000 - 16 03 01 01 35 01 00 01-31 03 03 35 19 66 ea 22   ....5...1..5.f."
0010 - 55 af d3 f6 d5 07 8b d1-17 1e 5f 70 e5 e2 00 bf   U........._p....
0020 - c1 db 76 f1 05 d5 d1 b5-43 03 28 20 20 d1 1b 55   ..v.....C.(  ..U
0030 - 79 7f 4b 0b 29 0c 91 11-d0 0b 66 ad 9e 15 97 8b   y.K.).....f.....
0040 - 25 a6 08 7d 3e d6 5e a3-9f 35 f5 36 00 3e 13 02   %..}>.^..5.6.>..
0050 - 13 03 13 01 c0 2c c0 30-00 9f cc a9 cc a8 cc aa   .....,.0........
0060 - c0 2b c0 2f 00 9e c0 24-c0 28 00 6b c0 23 c0 27   .+./...$.(.k.#.'
0070 - 00 67 c0 0a c0 14 00 39-c0 09 c0 13 00 33 00 9d   .g.....9.....3..
0080 - 00 9c 00 3d 00 3c 00 35-00 2f 00 ff 01 00 00 aa   ...=.<.5./......
0090 - 00 00 00 11 00 0f 00 00-0c 77 77 77 2e 74 65 73   .........www.tes
00a0 - 74 2e 63 6f 6d 00 0b 00-04 03 00 01 02 00 0a 00   t.com...........
00b0 - 0c 00 0a 00 1d 00 17 00-1e 00 19 00 18 00 23 00   ..............#.
00c0 - 00 00 16 00 00 00 17 00-00 00 0d 00 30 00 2e 04   ............0...
00d0 - 03 05 03 06 03 08 07 08-08 08 09 08 0a 08 0b 08   ................
00e0 - 04 08 05 08 06 04 01 05-01 06 01 03 03 02 03 03   ................
00f0 - 01 02 01 03 02 02 02 04-02 05 02 06 02 00 2b 00   ..............+.
0100 - 09 08 03 04 03 03 03 02-03 01 00 2d 00 02 01 01   ...........-....
0110 - 00 33 00 26 00 24 00 1d-00 20 f1 1f 76 d9 fd 2f   .3.&.$... ..v../
0120 - e6 1d ee fb 52 01 a6 6d-ec 2d 0e 48 d5 8d 2f 96   ....R..m.-.H../.
0130 - d9 63 0f 0d a8 ff 10 8f-29 6c                     .c......)l
read from 0x5595c5706d60 [0x5595c57ee823] (5 bytes => 0 (0x0))
write:errno=0
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 314 bytes
Verification: OK
---

We can see that openssl is expecting to get back 5 bytes after it sends its first 314 bytes but gets nothing.

In comparison this is what happens from the host:

CONNECTED(00000003)
write to 0x5607cfd710d0 [0x5607cfd81510] (304 bytes => 304 (0x130))
0000 - 16 03 01 01 2b 01 00 01-27 03 03 bc 22 21 47 06   ....+...'..."!G.
0010 - f0 16 e5 42 52 de 10 9b-1e 25 09 0a a7 44 ff b3   ...BR....%...D..
0020 - 86 3e da 0f 8b cb c8 ef-ab d4 3a 20 af 4e aa 29   .>........: .N.)
0030 - 27 98 2e 3d 4f d0 28 53-15 6d 4c 99 bc a8 c2 e2   '..=O.(S.mL.....
0040 - bb dc 58 30 49 df da b3-ba 33 32 d2 00 3e 13 02   ..X0I....32..>..
0050 - 13 03 13 01 c0 2c c0 30-00 9f cc a9 cc a8 cc aa   .....,.0........
0060 - c0 2b c0 2f 00 9e c0 24-c0 28 00 6b c0 23 c0 27   .+./...$.(.k.#.'
0070 - 00 67 c0 0a c0 14 00 39-c0 09 c0 13 00 33 00 9d   .g.....9.....3..
0080 - 00 9c 00 3d 00 3c 00 35-00 2f 00 ff 01 00 00 a0   ...=.<.5./......
0090 - 00 00 00 11 00 0f 00 00-0c 77 77 77 2e 74 65 73   .........www.tes
00a0 - 74 2e 63 6f 6d 00 0b 00-04 03 00 01 02 00 0a 00   t.com...........
00b0 - 0c 00 0a 00 1d 00 17 00-1e 00 19 00 18 00 23 00   ..............#.
00c0 - 00 00 16 00 00 00 17 00-00 00 0d 00 2a 00 28 04   ............*.(.
00d0 - 03 05 03 06 03 08 07 08-08 08 09 08 0a 08 0b 08   ................
00e0 - 04 08 05 08 06 04 01 05-01 06 01 03 03 03 01 03   ................
00f0 - 02 04 02 05 02 06 02 00-2b 00 05 04 03 04 03 03   ........+.......
0100 - 00 2d 00 02 01 01 00 33-00 26 00 24 00 1d 00 20   .-.....3.&.$... 
0110 - af 6c 19 76 95 96 c5 1d-d2 12 78 d8 dc 80 35 ec   .l.v......x...5.
0120 - cb 4e 32 d2 ca bf ec 78-3c ff cf cc 14 79 be 3f   .N2....x<....y.?
read from 0x5607cfd710d0 [0x5607cfd78203] (5 bytes => 5 (0x5))
0000 - 16 03 03 00 45                                    ....E
read from 0x5607cfd710d0 [0x5607cfd78208] (69 bytes => 69 (0x45))
0000 - 02 00 00 41 03 03 04 23-72 89 5b ff e9 89 cd 80   ...A...#r.[.....
...

We can see that it receives the 5 bytes as expected and then continues to read. It seems the issue is that it times out after 5 seconds, which I assume is the default openssl s_client timeout.

jakubgs commented 4 years ago

For reference the error when userns-remap is disabled the error from openssl is:

/ # openssl s_client -connect www.test.com:443 -prexit -debug
140351122832712:error:0200206F:system library:connect:Connection refused:crypto/bio/b_sock2.c:110:
140351122832712:error:2008A067:BIO routines:BIO_connect:connect error:crypto/bio/b_sock2.c:111:
connect:errno=111
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 0 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
jakubgs commented 4 years ago

And if I run the container with --network=host the issue disappears.

jakubgs commented 4 years ago

Ohhh fuck, I know what it is...

It's the NAT rule we added to make status-go also receive connections on 443, because it can't listen on it via settings: https://github.com/status-im/infra-eth-cluster/blob/10cf72e080795b3e90d310738fc0d41606207cb9/ansible/roles/statusd-mailsrv/tasks/firewall.yml#L17-L28 And here it how it looks:

admin@mail-01.do-ams3.eth.test:~ % sudo iptables -L PREROUTING -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
REDIRECT   tcp  --  anywhere             anywhere             tcp dpt:https /* Redirect 443 to 30504 */ redir ports 30504
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL

And as a command:

admin@mail-01.do-ams3.eth.test:~ % sudo grep Redirect /etc/iptables/rules.v4
-A PREROUTING -p tcp -m tcp --dport 443 -m comment --comment "Redirect 443 to 30504" -j REDIRECT --to-ports 30504

And it's all because I couldn't be bothered to specify the interface when I created this almost two years ago... this guy is a fucking idiot

jakubgs commented 4 years ago

With 5cf97a3f7b6498bc7abf9b4bd667c11f36fe6472 I changed the rule to look like this:

admin@mail-01.do-ams3.eth.test:~ % sudo grep Redirect /etc/iptables/rules.v4
-A PREROUTING -i eth0 -p tcp -m tcp --dport 443 -m comment --comment "Redirect 443 to 30504" -j REDIRECT --to-ports 30504

The port still looks correctly open:

 > sudo nmap -Pn -p443,30504 mail-01.do-ams3.eth.test  
Starting Nmap 7.80 ( https://nmap.org ) at 2020-08-25 19:15 CEST
Nmap scan report for mail-01.do-ams3.eth.test (206.189.243.161)
Host is up (0.031s latency).

PORT      STATE SERVICE
443/tcp   open  https
30504/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.18 seconds

And now it works:

/ # curl -s -XPOST https://gorush.status.im/api/push
{"code":400,"message":"Missing notifications field."}

Holy fuck that was dumb.

jakubgs commented 4 years ago

I've deployed the fixed rules to all three fleets.