Exposing and Circumventing China's Censorship of ESNI

Authors: Kevin Bock, iyouport, Anonymous, Louis-Henri Merino, David Fifield, Amir Houmansadr, Dave Levin

Date: Friday, August 7, 2020

This report first appeared on censorship.ai. We also maintain an up-to-date copy of the report on iyouport, gfw.report, net4people and ntc.party.

On 2020-07-30, iyouport reported (archive) the apparent blocking of TLS connections with the encrypted SNI (ESNI) extension in China. iyouport says that the first occurrence of blocking was one day earlier, on 2020-07-29.

We confirm that the Great Firewall (GFW) of China has recently begun blocking ESNI—one of the foundational features of TLS 1.3 and HTTPS. We empirically demonstrate what triggers this censorship and how long residual censorship lasts. We also present several evasion strategies discovered by Geneva that can be run either client-side or server-side to evade blocking.

What is Encrypted Server Name Indication (ESNI)?

TLS is the foundation of secure communication on the web (HTTPS). It provides authenticated encryption so that users can know with whom they are communicating, and that their information cannot be read or tampered with by an intermediary. Although TLS hides the content of a user's communication, it does not always hide with whom the user is communicating; the TLS handshake optionally contains a Server Name Indication (SNI) field that allows the user's client to inform the server which website it wishes to communicate with. Nation-state censors have used the SNI field to block users from being able to communicate with certain destinations. China, for one, has long been censoring HTTPS in this manner.

TLS 1.3 introduced Encrypted SNI (ESNI) that, put simply, encrypts the SNI so that intermediaries cannot view it. (To learn more about ESNI and its benefits, see Cloudflare's article). ESNI has the potential to complicate nation-states' abilities to censor HTTPS content; rather than be able to block only connections to specific websites, ESNI would require censors to block all TLS connections to specific servers. We do confirm that this is now happening in China!

Our Main Findings

The GFW blocks ESNI connections by dropping packets from client to server.
The blocking can be triggered bidirectionally.
The 0xffce extension is necessary to trigger the blocking.
The blocking can happen on all ports from 1 to 65535.
Once the GFW blocks a connection, it will continue blocking all traffic associated with the 3-tuples of (srcIP, dstIP, dstPort) for 120 or 180 seconds.
We have discovered 6 client-side and 4 server-side evasion strategies.

How Do We Know These?

We have made a simple Python program that performs the following:

completes a TCP handshake with a specified server;
and then sends a TLS ClientHello message with an ESNI extension; the fingerprint of the ClientHello is as normal as what Firefox 79.0 would send.

The program sends ClientHellos with ESNI both inside-out and outside-in, while capturing traffic on both sides for analysis. The servers to which we send ClientHellos complete the TCP handshake, but they do not send any data packets back to the client, nor do they are first to close the connection. All experiments were conducted between July 30th and August 6st.

Details About the Blocking

Blocking by dropping packets, not injecting RSTs

Comparing the traffic captured on both endpoints, we find the GFW blocks ESNI connections by dropping packets from clients to servers.

This has two differences from how the GFW censors other commonly-used protocols. First, the GFW censors (non-encrypted) SNI and HTTP by injecting forged TCP RSTs to both server and client; conversely, we have observed no injected packets from the GFW to censor ESNI traffic. Second, the GFW drops traffic from server to client to block Tor and Shadowsocks servers; however, it drops only client-to-server packets when censoring ESNI.

We further note the GFW does not distinguish the flags of TCP packets when dropping them. (This is different from some censorship systems in Iran which do not drop packets with RST or FIN flags.)

The blocking can be triggered bidirectionally

We find the blocking can be triggered bidirectionally. In other words, sending an ESNI handshake from outside the firewall to inside can get blocked in the same way as sending it inside-out.

Thanks to this bidirectional feature, one can test this ESNI-based censorship remotely from the outside of the GFW without having control of any Chinese server. The GFW's censorship on DNS, HTTP, SNI, FTP, SMTP, and Shadowsocks can also be measured outside-in.

The GFW censors ESNI, but not omit-SNI

We confirm a TLS ClientHello without ESNI/SNI extensions cannot trigger the blocking. In other words, the 0xffce payload of the encrypted_server_name extension is necessary to trigger the blocking.

We tested this by replacing the 0xffce in a triggering ClientHello with 0x7777. After the replacement, sending such a ClientHello could not trigger the blocking anymore.

This confirmation is important because some censors have been observed blocking any ClientHello message without the SNI extension, which would result in the blocking of both ESNI and omitting-SNI.

New extension values are not blocked

As informed by an anonymous reviewer on the riseup pad, the currently deployed ESNI uses extension value 0xffce (see Section 8.1). However, the newer ECH uses extension value 0xff02, 0xff03 and 0xff04(Section 11.1). We confirm no censorship has been observed on these extension values yet.

Specifically, we replace the 0xffce in a triggering ClientHello with the values of 0xff02, 0xff03, and 0xff04 respectively. And no blocking is observed after sending such modified ClientHellos.

A complete TCP handshake is required before triggering the blocking

We find a complete TCP handshake is necessary in order to trigger the ESNI blocking.

We conducted two experiments from the outside to a server in China. In the first experiment, without sending any SYN packet, our client sent one naked ClientHello message with ESNI extension every 2 seconds. In the second experiment, our client sent a SYN packet and a ClientHello message with ESNI extension; but the server would not respond with any packet (not even to complete the TCP three-way handshake).

In total, we sent 10 ClientHello messages in each experiment. The result shows no blocking or residual censorship was ever triggered; all ClientHello messages reached the server. This means a TCP handshake is necessary before triggering ESNI-based censorship. It also indicates, similar to the SNI-based censorship by the GFW, the censorship machine for ESNI is stateful.

The blocking can happen on all ports

We find the ESNI blocking can happen not only on port 443, but on all ports from 1 to 65535.

Specifically, we sent two ESNI handshakes in a row to the port 1-65535 of a Chinese server from the outside. For each port, we first sent an ESNI handshake; then after the connection timeout (after 20 seconds), we tried to complete a TCP handshake with the server again. If we do not receive any SYN+ACK from the server the second time, we consider the censorship occurred on that port. As a result, the ESNI blocking was observed on all ports from 1 to 65535.

This feature allows us to test ESNI censorship efficiently, as we can conduct testings on multiple ports of the same IP address simultaneously.

Residual Censorship

We find that the GFW employs "residual censorship" of ESNI connections. This means that, for some amount of time after triggering censorship for a given connection, it will continue blocking any connections with the same 3-tuple of source IP, destination IP, and destination port.

The precise duration of residual censorship appears to vary by vantage point. We observed residual censorship for 120 seconds at two of our vantage points, and 180 seconds at another vantage point.

Sending additional ESNI handshakes during residual censorship time does not reset the timer of the censoring machine. This is similar to the previously observed residual censorship on SNI-based blocking of the GFW. (Conversely, each additional packet set while residual censorship in effect in Iran resets the timer.)

These findings are partially based on the following experiment. From the outside, we sent one ClientHello message per second to port 443 of a Chinese server. The 1st, 2nd, and 121st TCP handshakes were accepted. All other handshake attempts were unsuccessful because the SYNs did not reach the server.

This result shows, similar to previously discovered SNI-based residual censorship, the GFW also employs residual censorship for ESNI. In addition, the fact that second handshake could complete means that it takes at least 1 second for the GFW to react and enable the blocking rules.

How Can We Circumvent the Blocking?

Geneva (Genetic Evasion) is a genetic algorithm developed by those of us at the University of Maryland that automatically discovers new censorship evasion strategies. Geneva manipulates packet streams—injecting, altering, fragmenting, and dropping packets—in a manner that bypasses censorship without impacting the original underlying connection. Unlike most other anti-censorship systems, Geneva does not require deployment at both sides of the connection: it runs exclusively at one side (client or server).

Geneva trains its genetic algorithm against live censors, and to date has found dozens of censorship evasion strategies in various countries. Geneva's strategies are expressed in a domain-specific language. Details of the language, along with the entire Geneva codebase, are available at the Geneva GitHub repository.

To learn more about how Geneva (or the Geneva strategy engine) works under the hood, see our papers or about page.

To allow Geneva to train directly against the GFW's ESNI censorship, we wrote a custom plugin that performs the following steps:

Geneva starts a TCP server on a random open port on a vantage point located outside of China. By randomizing our ports, we do not need to worry about residual censorship.
Geneva drives a TCP client located inside of China to connect to the server.
The client sends a TLS 1.3 ClientHello with the Encrypted SNI extension.
The client sleeps for 2 seconds to allow the GFW censorship to kick in.
The client sends a short test message "test" to test if it has been censored.
Steps 4 & 5 are repeated.
The server confirms that it receives both the full TLS ClientHello from the client and the test messages. If it does, the strategy is rewarded with a positive fitness; if not (or if the client timed out while sending its test messages), the strategy is punished.

With this, Geneva discovered multiple evasion strategies in just a few hours. We describe them in detail below.

The Geneva strategy engine is open source on our Github.

All of these strategies can be run with our open-source Geneva strategy engine (repository). Since they operate at the TCP layer, they can be applied to any application that needs to use ESNI: with Geneva running, even an unmodified web browser can become a simple censorship evasion tool.

Note that Geneva is not designed as a general purpose evasion tool, and does not provide any additional encryption, privacy, or protection. It is a research prototype and it is not optimized for speed. Use these strategies at your own risk.

Evasion strategies

We trained Geneva over the span of 48 hours, both client- and server-side. In total, we discovered 6 strategies to defeat the ESNI censorship: 4 that work from the server, and 6 that work from the client.

The following are TCP-layer strategies that can defeat the ESNI censorship when applied exclusively at the client-side.

Strategy 1: Triple SYN

The first client strategy works by initiating the TCP 3-way handshake with three SYN packets, such that the sequence number of the third SYN is corrupted.

In Geneva's syntax, this strategy looks like this: [TCP:flags:S]-duplicate(duplicate,tamper{TCP:seq:corrupt})-| \/

This strategy performs a desynchronization attack against the Great Firewall. The GFW synchronizes on the corrupt sequence number, so it misses the ESNI request.

This strategy can also be applied from the server-side:

[TCP:flags:SA]-tamper{TCP:flags:replace:S}(duplicate(duplicate,tamper{TCP:seq:corrupt}),)-| \/

Although this strategy makes it so the server never sends a SYN+ACK packet, this does not break the three-way handshake. During the three-way handshake, instead of the server sending a SYN+ACK packet as usual, the server instead sends three SYN packets (the third with a corrupt sequence number).

The first SYN packet serves to initiate a TCP Simultaneous Open, an archaic feature of TCP supported by all major operating systems to handle the case in which two TCP stacks send a SYN packet at the same time. When the client receives a SYN from the server, the client sends a SYN+ACK packet, and server responds with an ACK to complete the handshake. This effectively changes the traditional three-way handshake to a four-way handshake. The SYN with the corrupt sequence number causes the GFW to desynchronize (but is ignored by the client), successfully defeating censorship without harming the connection.

Strategy 2: Four Byte Segmentation

The next strategy we discover can also be used from client or server. In this strategy, the client sends the ESNI request across two TCP segments, such that the first TCP segment is less than or equal to 4 bytes long.

From the client-side, in Geneva's syntax this strategy looks like this: [TCP:flags:PA]-fragment{tcp:4:True}-| \/

This is not the first time Geneva has discovered segmentation strategies, but it is surprising that this strategy works in China. The Great Firewall has been famous for its ability to reassemble TCP segments for almost a decade now (see brdgrd). The TLS header is 5 bytes long, so by segmenting specifically the TLS header across multiple packets, we hypothesize this breaks the GFW's ability to protocol fingerprint ESNI packet as TLS. This has interesting implications for how the GFW fingerprints connections: it suggests the component of the GFW that performs connection fingerprinting cannot reassemble TCP segments for all protocols. This theory is supported by other segmentation-based strategies identified by Geneva in the past (see this paper).

This strategy can also be triggered from the server-side. By reducing the TCP window size during the 3-way handshake, a server can force the client to segment their request. In Geneva's syntax, this can be accomplished with: [TCP:flags:SA]-tamper{TCP:window:replace:4}-| \/.

Strategy 3: TCB Teardown

The next strategy is a classic TCB (TCP Control Block) Teardown: the client injects a RST packet with a broken checksum into the connection. This tricks the GFW into thinking the connection has been torn down.

In Geneva's syntax, this strategy looks like: [TCP:flags:A]-duplicate(,tamper{TCP:flags:replace:RA}(tamper{TCP:chksum:corrupt},))-| \/

TCB Teardowns are not new: they were demonstrated almost a decade ago by Khattak et al., and Geneva has discovered Teardown attacks repeatedly in the past against the GFW.

Surprisingly, this strategy also can be induced from the server-side. During the three-way handshake, the server can send a SYN+ACK packet with a corrupt acknowledgement number, thereby inducing the client to send a RST. This causes the RST to have an incorrect sequence number (and an acknowledgement number of 0, but it still is sufficient to cause a TCB Teardown.

Strategy 4: FIN+SYN

The next strategy appears to be another desychronization attack, but via a different attack vector. In this strategy, the client (or the server) sends a packet with the FIN and SYN flags both set during the three-way handshake. For the client, in Geneva's syntax: [TCP:flags:A]-duplicate(tamper{TCP:flags:replace:FS},)-| \/ For the server, in Geneva's syntax: [TCP:flags:SA]-duplicate(tamper{TCP:flags:replace:FS},)-| \/

In the past, we've found the GFW against other protocols has special handling for FIN packets when it comes to resynchronization. In this case, it looks like the presence of the FIN causes the GFW to immediately resynchronize, but the presence of the SYN causes it to think the actual seqno is +1 from the actual value, making the GFW off by 1 from the real connection.

We tested this hypothesis by incrementing the sequence number of the actual request by 1 while this strategy was running, and saw that the client got censored.

From the server-side, the FIN flag is not required for this strategy to work.

Strategy 5: TCB Turnaround

The TCB Turnaround strategy is simple: before the client initiates the three-way handshake, it first sends a SYN+ACK packet to the server. The SYN+ACK causes the GFW to confuse the roles of the client and server, thereby allowing the client to communicate unimpeded. TCB Turnaround attacks still work in Kazakhstan, but turnaround attacks do not work against the GFW for any other protocols.

In Geneva's syntax: [TCP:flags:S]-duplicate(tamper{TCP:flags:replace:SA},)-| \/

This strategy is client-only, since by the time the SYN packet arrives at the server, the censor already knows which side is the client.

Strategy 6: TCB Desynchronization

Finally, Geneva identified simple payload-based TCB desynchronization. From the client, injecting a packet with a payload and a broken checksum is sufficient to desynchronize the GFW from the connection. Geneva has identified these in the past against the GFW's censorship of other protocols as well.

In Geneva's syntax: [TCP:flags:A]-duplicate(tamper{TCP:load:replace:AAAAAAAAAA}(tamper{TCP:chksum:corrupt},),)-|

This strategy cannot be used from the server-side.

Summary on Circumvention Strategies

In total, we have discovered 6 strategies that work from the client-side, and 4 that work from the server-side. Each of these works with near 100% reliability, and can be used to evade the ESNI censorship. Unfortunately, these specific strategies may not be a long-term solution: as the cat and mouse game progresses, the Great Firewall will likely to continue to improve its censorship capabilities.

Unresolved Questions

It is not yet clear why we observe different durations of residual censorship from different vantage points. As with all such research, it is also possible that there are some regions of China that are affected in different ways than our vantage points. If you observe different behavior or that some of our evasion strategies do not work, please feel free to contact us!

Thanks

We want to thank all anonymous reviewers who offered us valuable and immediate questions, feedback and suggestions on the riseup pad. These comments guided us to prioritize the questions that interest the community the most; and thus greatly accelerated our research.

We are also thankful to the OONI and OTF community for all of their support.

Contacts

Geneva team:

Kevin Bock (PGP key)
Dave Levin (PGP key)

GFW Report:

Anonymous (PGP key)
Amir Houmansadr (PGP key)

EDIT: Triggering an ESNI block from the outside no longer works, since 2020-08-13 06:32. See below.

Because the GFW's ESNI detection is bidirectional, you can easily experiment with the blocking yourself, even if you are located outside of China.

Here is a short payload that triggers blocking. (ffce is the ESNI extension that the GFW is looking for.)

160303003b0100003703035b72616e646f6d72616e646f6d72616e646f6d7261
6e646f6d72616e646f6d5d0000000100000effce000a53754772000000000000

Choose any responsive TCP host in China. It doesn't have to be port 443. For example, www[]().tsinghua.edu.cn:80.

Start TCP-pinging the port using, for example, hping or Nping. You will see responses to your pings.

$ sudo hping3 -S www.tsinghua.edu.cn -p 80
HPING www.tsinghua.edu.cn (eth0 166.111.4.100): S set, 40 headers + 0 data bytes
len=44 ip=166.111.4.100 ttl=102 id=56332 sport=80 flags=SA seq=0 win=2105 rtt=279.8 ms
len=44 ip=166.111.4.100 ttl=102 id=63224 sport=80 flags=SA seq=1 win=2105 rtt=283.7 ms
len=44 ip=166.111.4.100 ttl=102 id=19078 sport=80 flags=SA seq=2 win=2105 rtt=287.5 ms
len=44 ip=166.111.4.100 ttl=102 id=1977 sport=80 flags=SA seq=3 win=2105 rtt=279.3 ms
len=44 ip=166.111.4.100 ttl=102 id=41003 sport=80 flags=SA seq=4 win=2105 rtt=283.1 ms


$ nping -4 -c 0 --tcp-connect www.tsinghua.edu.cn -p 80

Starting Nping 0.7.70 ( https://nmap.org/nping ) SENT (0.0752s) Starting TCP Handshake > www.tsinghua.edu.cn:80 (166.111.4.100:80) RCVD (0.3682s) Handshake with www.tsinghua.edu.cn:80 (166.111.4.100:80) completed SENT (1.0778s) Starting TCP Handshake > www.tsinghua.edu.cn:80 (166.111.4.100:80) RCVD (1.3598s) Handshake with www.tsinghua.edu.cn:80 (166.111.4.100:80) completed SENT (2.0798s) Starting TCP Handshake > www.tsinghua.edu.cn:80 (166.111.4.100:80) RCVD (2.3584s) Handshake with www.tsinghua.edu.cn:80 (166.111.4.100:80) completed SENT (3.0818s) Starting TCP Handshake > www.tsinghua.edu.cn:80 (166.111.4.100:80) RCVD (3.3580s) Handshake with www.tsinghua.edu.cn:80 (166.111.4.100:80) completed SENT (4.0840s) Starting TCP Handshake > www.tsinghua.edu.cn:80 (166.111.4.100:80) RCVD (4.3603s) Handshake with www.tsinghua.edu.cn:80 (166.111.4.100:80) completed

3. Send the trigger payload.

printf '\x16\x03\x03\x00\x3b\x01\x00\x00\x37\x03\x03[randomrandomrandomrandomrandom]\x00\x00\x00\x01\x00\x00\x0e\xff\xce\x00\nSuGr\x00\x00\x00\x00\x00\x00' | nc -4 -v www.tsinghua.edu.cn 80


4. Now you will stop receiving replies to your pings for 120 or 180 seconds.

<details>
<summary>Python 3 program to generate trigger payload</summary>

<pre>
#!/usr/bin/env python3

# Generates a small TLS ClientHello that trigger's the GFW's ESNI detector.
# Writes output to the file minimal.bin.
#
# You can send the ClientHello with, for example,
#     nc -v www.tsinghua.edu.cn 443 < minimal.bin

import struct

from scapy.all import *
load_layer("tls")
from scapy.layers.tls.all import *

# https://tools.ietf.org/html/rfc8446#section-3.4
def var(ceiling, data):
    if ceiling < 256:
        fmt = ">B"
    elif ceiling < 65536:
        fmt = ">H"
    else:
        raise ValueError(ceiling)
    return struct.pack(fmt, len(data)) + data

# https://datatracker.ietf.org/doc/html/draft-ietf-tls-esni-01#section-5
def encrypted_server_name(suite, group, key_exchange, record_digest, encrypted_sni):
    return struct.pack(">HH", suite, group) \
        + var(65535, key_exchange) \
        + var(65535, record_digest) \
        + var(65535, encrypted_sni)

clienthello = TLS(
    msg = TLSClientHello(
        gmt_unix_time = 0x5b72616e, # "[ran"
        random_bytes = b"domrandomrandomrandomrandom]",
        ciphers = [],
        ext = [
            # The GFW detector requires a syntactically valid
            # server_name_extension, but the actual values it contains may be
            # nonsense. Here we use a CipherSuite of 0x5375 ("Su"), a NamedGroup
            # of 0x4772 ("Gr"), and zero-length key_exchange, record_digest, and
            # encrypted_sni.
            TLS_Ext_Unknown(type=0xffce, val=encrypted_server_name(0x5375, 0x4772, b"", b"", b"")),
        ],
    )
)

TLS(bytes(clienthello)).show()
print(bytes(clienthello))

FILENAME = "minimal.bin"
open(FILENAME, "wb").write(bytes(clienthello))
print("output written to {}".format(FILENAME))
</pre>
</details>

Specifically, we replace the 0xffce in a triggering ClientHello with the values of 0xff02, 0xff03, and 0xff04 respectively. And no blocking is observed after sending such modified ClientHellos.

The GFW requires the 0xffce extension to be syntactically correct. Did you also try to produce syntactically correct 0xff02, 0xff03, and 0xff04 extensions, or did you simply replace 0xffce with each of those values? If it was simple replacement, then it leaves open the possibility that the GFW would block the newer ECH extensions if they were syntactically valid.

To show you what I mean, the syntax of the encrypted_server_name extension is:

 2 bytes  suite
 2 bytes  group
 2 bytes  key_exchange length
variable  key_exchange
 2 bytes  record_digest length
variable  record_digest
 2 bytes  encrypted_sni length
variable  encrypted_sni

Typical observed values when using Firefox and a Cloudflare TLS server are suite=0x1301 (TLS_AES_128_GCM_SHA256), group=0x001d (X25519), key_exchange length=32, record_digest length=32, and encrypted_sni length=292 (260 padded bytes plus 32 bytes of AEAD tag). The short trigger payload has suite=0x5375 ("Su"), group=0x4772 ("Gr"), and key_exchange length, record_digest length, and encrypted_sni length all zero, so it's syntactically correct although meaningless. But in my testing, if you increase encrypted_sni length, for example, without also appending the same amount of data to the extension, it will not trigger blocking.

Thank you @wkrp for sharing this step by step tutorial on reproducing the ESNI blocking.

Did you also try to produce syntactically correct 0xff02, 0xff03, and 0xff04 extensions, or did you simply replace 0xffce with each of those values?

We simply replaced the 0xffce with 0xff02, 0xff03 and 0xff04.

If it was simple replacement, then it leaves open the possibility that the GFW would block the newer ECH extensions if they were syntactically valid.

We agreed that this was a very good point and was also a valid concern. We will test the GFW against valid ECHs and get back to this thread.

But in my testing, if you increase encrypted_sni length without also appending the same amount of data to the extension, it will not trigger blocking.

This is a very interesting finding. It reminds us a similar finding we had when trying to determine GFW's minimal triggering condition of SNI-based censorship.

It is a bit off-topic but let us share our findings along with the (dirty) code here:

```python3 import struct """Extension - Server Name The client has provided the name of the server it is contacting, also known as SNI (Server Name Indication). Without this extension a HTTPS server would not be able to provide service for multiple hostnames on a single IP address (virtual hosts) because it couldn't know which hostname's certificate to send until after the TLS session was negotiated and the HTTP request was made. 00 00 - assigned value for extension "server name" 00 18 - 0x18 (24) bytes of "server name" extension data follows 00 16 - 0x16 (22) bytes of first (and only) list entry follows 00 - list entry is type 0x00 "DNS hostname" 00 13 - 0x13 (19) bytes of hostname follows 65 78 61 ... 6e 65 74 - "example.ulfheim.net" """ def construct_sni(server_name): extension_value = bytearray.fromhex("0000") bytes_followed_data = struct.pack('>H', len(server_name) + 5) bytes_followed_entry = struct.pack('>H', len(server_name) + 3) list_entry_type = bytearray.fromhex("00") bytes_followed_hostname = struct.pack('>H', len(server_name)) server_name_hex = bytearray(server_name, 'utf-8') return extension_value + bytes_followed_data + bytes_followed_entry + list_entry_type + bytes_followed_hostname + server_name_hex # source: https://tls.ulfheim.net/ # Minimal ClientHello to trigger the payload. def contruct_clienthello(server_name): record_header = "" handshake_header = "" # The GFW is not checking if the extension length value is correct or not. # But they must have the length. CLIENT_VERSION = bytearray.fromhex("0303") CLIENT_RANDOM = bytearray.fromhex("000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f") SESSION_ID = bytearray.fromhex("00") CIPHER_SUITE = bytearray.fromhex("0020cca8cca9c02fc030c02bc02cc013c009c014c00a009c009d002f0035c012000a") COMPRESSION_MEHTODS = bytearray.fromhex("0100") extensions_length = struct.pack('>H', len(construct_sni(server_name))) # The GFW checks the Handshake Header (second section): num of bytes of ClientHello follows. # If correct_length - 3 <= x <= correct_length, trigger RST. handshake_header = bytearray.fromhex("0100") + struct.pack('>H', 80 + len(server_name)) # The GFW checks the Recrod Header (first section): total length. # If >= correct_length, trigger RST, # < correct_length does not trigger RST. Why? record_header = bytearray.fromhex("160301") + struct.pack('>H', 4 + 80 + len(server_name)) return record_header + handshake_header \ + CLIENT_VERSION + CLIENT_RANDOM \ + SESSION_ID \ + CIPHER_SUITE \ + COMPRESSION_MEHTODS \ + extensions_length \ + construct_sni(server_name) legit_payload_hex = contruct_clienthello("example.youtube.com") ```

I'd like to start some speculations:

Why block ESNI? This seems similar to the way it used to block HTTPS yet allow unencrypted HTTP traffic, i.e. a stopgap measure before better traffic inspection system is developed. But various findings above also suggest that a correct TLS 1.3 parser is in production in GFW. So the GFW is probably somewhere between being comfortable with SNI and experimenting with ESNI traffic.
Why now? According to Wikipedia ESNI is only in production in Firefox since March 2020 and it is not supported in Chrome. Lack of Chrome support indicates poor utility in deploying this policy. But ESNI has been indeed been some time in the making. My bet would put this more on the experimental side.
Why dropping traffic vs sending TCP resets? I was not aware of this new development, but it makes sense. Sending TCP resets as the way of blocking traffic was shown to be unreliable from time to time as it generates quite some more traffic which can be again lost, for which the resets are often sent in duplets or triplets. Dropping packets would be indeed more reliable. I knew from related Chinese literature that there are route blackholing infrastructures (gateway routers capable of maintaining and updating millions of blackhole rules in real time) linked with traffic detection frontends to handle this type of dynamic blocking, but taking only 1 second to enact such a rule is quite a result. I would suggest trying millisecond-level probes to test the latency from detection events to blocking events, which could be useful in mapping its internal topology. Because it takes some time for a rule to propagate from detection frontend to blocking backend, and longer if the propagation crosses geographical regions. If there are some rare cases that a (source IP, destination IP, and destination port) tuple is routed via both Beijing and Guangzhou, I imagine it would take longer for the Guangzhou data center to enact a rule detected in Beijing.
Why 120 seconds vs 180 seconds? This could be related to its geographical features. Different data centers have different parameters and unifying it is more trouble than useful.
Why additional detection event does not renew blocking timers and why it does in Iran? Renewing the timers would require update operations from detection frontends to the blackholing routers. These updates cost traffic, CPUs, resources. Not doing it would be a form of optimization that is more needed for handling China's national traffic than Iran's.

Thank you @klzgrad for your informative and inspiring comments.

I would suggest trying millisecond-level probes to test the latency from detection events to blocking events, which could be useful in mapping its internal topology.

Why 120 seconds vs 180 seconds? This could be related to its geographical features.

We will get back to this thread with experiment results.

Why now? According to Wikipedia ESNI is only in production in Firefox since March 2020 and it is not supported in Chrome. Lack of Chrome support indicates poor utility in deploying this policy. But ESNI has been indeed been some time in the making.

That's a good question. My take is that the GFW can afford to block ESNI now only because it is not yet widely used. If they waited until ESNI/ECH were essential to a large fraction of connections, then it would be more expensive. This is like a game where the first to move has an advantage.

RFC 8744, "Issues and Requirements for Server Name Identification (SNI) Encryption in TLS," has a requirement "Do Not Stick Out". There are two ways to meet this requirement. One way is to make connections whose SNI is encrypted indistinguishable from connections whose SNI is unencrypted. The other way is to do a sudden, massive deployment, so that even if encrypted-SNI connections are tagged and easily distinguishable, those tags become a feature of normal TLS traffic. If you want to blend in with a crowd, you can change yourself to match the surroundings; or you can change the surroundings to match yourself. I think the IETF was banking on the latter strategy being more likely of success, and I don't necessarily disagree. Deployment of encrypted SNI was always precarious, with a risk of failure.

The good news is that there will be a second chance with ECH (Encrypted Client Hello), which is the name for the latest revision of what was called ESNI. ECH uses different extension values which are not blocked yet, as far as we know.

My bet would put this more on the experimental side.

You may be right about that. We should not assume the ESNI block is permanent. The GFW sometimes institutes a new rule and later walks back. I am thinking of this case in 2016 when the GFW blocked an Azure CDN edge server for about four days, but did not re-block it when the server changed its IP address.

Starting from Thursday, August 13, 2020 6:32 AM UTC, we could not trigger ESNI blocking from the outside of China to the inside of China anymore from different vantage points. The last observed ESNI blocking triggered from outside-in was Thursday, August 13, 2020 6:27 AM UTC.

We confirm the ESNI blocking can still be triggered inside-out as of Thursday, August 13, 2020 7:50 AM UTC.

Could anyone corroborate our observation?

I knew from related Chinese literature that there are route blackholing infrastructures (gateway routers capable of maintaining and updating millions of blackhole rules in real time) linked with traffic detection frontends to handle this type of dynamic blocking, but taking only 1 second to enact such a rule is quite a result. I would suggest trying millisecond-level probes to test the latency from detection events to blocking events, which could be useful in mapping its internal topology.

During the time when ESNI blocking could still be triggered from the outside-in direction, we did the following experiment to test the latency from detection events to blocking events. This was a snapshot (one-time measurement) study, and was only tested from one point to another.

In specific, we used a script as follows:

#!/bin/bash

sudo -v

IP="REDUCTED"
PORT="80"

sudo tcpdump host "$IP" and port "$PORT" -Uw "delay.pcap" &

sleep 2

# wait 100000 micro seconds between each SYN
sudo hping3 -S "$IP" -p "$PORT" -i u100000 | tee output_hping.txt &

# send ESNI handshake
sudo python3 esni.py "$IP" "$PORT" | tee esni_output.txt &

sleep 10

sudo pkill python3
sudo pkill hping3

sleep 10

sudo pkill tcpdump

In each experiment, we changed the sending rate of the SYN ping. There is trade-off between the SYN ping rate: faster SYN ping rate allows us to have more precision in the blocking delay; but it may also introduce congestion and/or overwhelm the server with SYN flood.

We analyzed the pcap files captured. In specific, we tried to find the timestamps of the following three events: the time when ESNI sent from the client; the last SYN that got the SYN/ACK; and the first SYN that did not get the SYN/ACK. Blocking should happen between the last SYN that got the SYN/ACK and the first SYN that did not get the SYN/ACK. And the delay of blocking should be at most as long as the interval between ESNI sent from client and the first SYN that did not get the SYN/ACK.

The results, shown in relative timestamp, are as follows:

SYN ping rate: u100000 (wait 100000 micro seconds between each SYN): 1.243874s ESNI sent from client 1.305223s the last SYN that got the SYN/ACK 1.405656s the first SYN that did not get the SYN/ACK

u10000: 0.259244s ESNI sent from client 0.824456s the last SYN that got the SYN/ACK 0.834909s the first SYN that did not get the SYN/ACK

u1000: 0.251383s ESNI sent from client 2.068717s the last SYN that got the SYN/ACK 2.069755s the first SYN that did not get the SYN/ACK

u100 first experiment: 0.283411s ESNI sent from client 2.088801s the last SYN that got the SYN/ACK 2.088914s the first SYN that did not get the SYN/ACK u100 second experiment: 0.290077s ESNI sent from client 2.114338s the last SYN that got the SYN/ACK 2.114449s the first SYN that did not get the SYN/ACK

In summary, the shortest observed delay of blocking between the time when GFW sees a ClientHello with ESNI and the time when all packets are dropped happened in the experiment when the sending rate was u100000. And the shortest delay was at most 1.405656s - 1.243874s = 0.161782 second.

Starting from Thursday, August 13, 2020 6:32 AM UTC, we could not trigger ESNI blocking from the outside of China to the inside of China anymore from different vantage points. The last observed ESNI blocking triggered from outside-in was Thursday, August 13, 2020 6:27 AM UTC.

At about 2020-08-13 14:30 UTC, I see the same. I put the short trigger payload in a file minimal.bin, then ran the following commands. The TCP ping kept receiving responses even after sending the trigger payload.

nping -c 0 --tcp-connect www.tsinghua.edu.cn -p 80
ncat -v www.tsinghua.edu.cn 80 < minimal.bin

I had the same results with various destinations: www.tsinghua.edu.cn:80 (166.111.4.100:80), www.tsinghua.edu.cn:443 (166.111.4.100:443), www.china-railway.com.cn:80 (183.131.168.120:80), www.12306.cn:80 (61.147.210.242:80), www.miit.gov.cn:80 (202.106.121.6:80). I am not able to test from inside.

SNI-based TCP RST injection can still be triggered from outside. I tried the following commands. The first command returns a certificate and a working connection. The second command causes three immediate RSTs.

openssl s_client -connect www.tsinghua.edu.cn:443 -servername www.example.com
openssl s_client -connect www.tsinghua.edu.cn:443 -servername www.facebook.com

As a side note, today I am able to visit China-based web sites in Tor Browser (https://www.tsinghua.edu.cn/, http://www.china-railway.com.cn/, https://www.12306.cn/, http://www.miit.gov.cn/). I do not know if it is related. The GFW has, of course, blocked connections to Tor relays for a long time, but for the past few years it has also blocked connections from Tor relays, including exits. I cannot remember exactly when the outside-in blocking of Tor began, but I think it was in 2016 or 2017.

net4people / bbs