Looking for: A good paper on how TLS-in-TLS detection works?

mmmray commented 1 year ago

related to #280, it got me thinking about TLS-in-TLS: I wonder if ECH and/or GREASE in the inner layer would (temporarily) confuse those heuristics.

but then, i realized, i have no idea how TLS-in-TLS is detected today. is there an (english) summary of it?

I only see some vague allusions towards it at https://github.com/XTLS/Xray-core/discussions/1295 (google translate) which talks about packet sizes and timings being detectable via machine learning. But it's not very concrete about how this detection works (I think it's a good read regardless)

diwenx commented 1 year ago

TLS is also the most common type of traffic, regardless of the form of the proxy itself, the traffic being proxied is basically TLS, maximizing the impact of the problem.

This is spot on. You cannot avoid generating this signature by simply not using TLS. And for the exact same reason, the prevalence and fingerprintability of TCP handshakes (3-way hello and 4-way finish) present a vulnerability for layer 2 tunnels like obfuscated VPNs, regardless of whether the VPN itself is TCP- or UDP-based.

If you don't do something to disguise the timing and directionality pattern of TLS, then that evidence of tunneling will show through.

It would be great if browsers could share some responsibility with proxies. Imagine a GREASE extension that could fit into any TLS packet type, serving only to inflate packet sizes. But I guess as long as TLS-in-TLS remains a "niche" security concern affecting only users from select regions, such an initiative may remain unlikely.

wkrp commented 1 year ago

It would be great if browsers could share some responsibility with proxies. Imagine a GREASE extension that could fit into any TLS packet type, serving only to inflate packet sizes. But I guess as long as TLS-in-TLS remains a "niche" security concern affecting only users from select regions, such an initiative may remain unlikely.

TLS does have a built-in feature to pad records that are encrypted:

https://www.rfc-editor.org/rfc/rfc8446.html#section-5.2

struct {
    opaque content[TLSPlaintext.length];
    ContentType type;
    uint8 zeros[length_of_padding];
} TLSInnerPlaintext;

https://www.rfc-editor.org/rfc/rfc8446#section-5.4

All encrypted TLS records can be padded to inflate the size of the TLSCiphertext. This allows the sender to hide the size of the traffic from an observer.

When generating a TLSCiphertext record, implementations MAY choose to pad. An unpadded record is just a record with a padding length of zero. Padding is a string of zero-valued bytes appended to the ContentType field before encryption. Implementations MUST set the padding octets to all zeros before encrypting.

Application Data records may contain a zero-length TLSInnerPlaintext.content if the sender desires. This permits generation of plausibly sized cover traffic in contexts where the presence or absence of activity may be sensitive. Implementations MUST NOT send Handshake and Alert records that have a zero-length TLSInnerPlaintext.content; if such a message is received, the receiving implementation MUST terminate the connection with an "unexpected_message" alert.

I don't think you can use this record padding feature in the Client Hello, but there you can use the padding extension:

This memo describes a Transport Layer Security (TLS) extension that can be used to pad ClientHello messages to a desired size.

Here's @ValdikSS's past demonstration of using the padding extension to get around a filter:

https://ntc.party/t/http-headerstls-padding-as-a-censorship-circumvention-method/168 https://ntc.party/t/firefox-for-android-with-tls-padding-for-censorship-circumvention/1725

That was intended for when the TLS is sent without a tunnel around it, but it could also work to break up the traffic signature. Unfortunately, merely adding padding doesn't change the overall directionality of bursts. I'm not sure if it's possible in TLS to, for example, send "no-op" records before the handshake, and in any case changing the directionality would likely require cooperation from the server.

diwenx commented 1 year ago

but then, i realized, i have no idea how TLS-in-TLS is detected today. is there an (english) summary of it?

I was referred to a paper from this year's SIGCOMM GGFAST: Automating Generation of Flexible Network Traffic Classifiers https://dl.acm.org/doi/pdf/10.1145/3603269.3604840

While it doesn't look at TLS-in-TLS specifically, section 7 explores encrypted flow classification, with a subsection on how to classify SMTP flows when tunneled within TLS.

We trained an SMTP classifier, using 25,000 flows of plaintext SMTP traffic <...> evaluated it on the TLS flows of that same dataset, using the TLS sequence-of-lengths variant.

The methodology proposed in the paper might provide insights into detecting TLS within TLS. The basic premise is to train a classifier on plaintext protocols (like plain TLS) using features that remain stable/visible post-encryption such as packet sizes, direction, and timing. This classifier can then be applied to the payload part of encrypted flows.

It seems that their classifiers achieved pretty decent precision for detecting SMTP-in-TLS.

Only a small fraction (0.4%) of other non-SMTP TLS flows are mislabeled as SMTP <...> out of the 14,474 false positives, 9,200 correspond to IMAP-over-TLS and POP3-over-TLS traffic. Although these are still false positives, these protocols are adjacent to SMTP and have very similar syntax.

klzgrad commented 1 year ago

TLS does have a built-in feature to pad records that are encrypted

And HTTP/2 the protocol also has builtin padding fields, but among the implementations of the two protocols paddings are mostly an afterthought and it's annoying to try to create paddings through the existing APIs of these implementations without patching their cores. In terms of sustaining long-term maintenance, I'd prefer not to use the forgotten builtin paddings.

One question I am still struggling with is what the tunnel's traffic sending schedule should actually be; i.e., client-first or server-first, what mix of burst sizes and directionality

I have this intuition so far: The tunnel's traffic schedule should parrot what the tunnel "should" look like as if it is not a tunnel. As an example, I have a HTTP/2 tunnel that sends alpn with h2 and is based on actually existing HTTP/2 implementations, then the tunnel payload should be reorganized into what a regular HTTP/2 connection would look like: a series of 50-200 bytes of requests and a bunch of large downloads. The issue of directionality can be explained away by HTTP/2 pipelining and multiplexing, e.g. even though the TLS handshakes look like several ping pong round trips, it is just the client sending CSS requests first and image requests later. I don't know if this pass the dead parrot test or not, just a thought.

The struggle is probably of coming up with a general traffic schedule, but if there is a more specific scope for parroting, it is easier to narrow down the target distribution. But not too specific, this is not the classic definition of parroting, as we not are parroting a particular application or protocol in terms of their structures, but their traffic distributions. The straightforward and brutal force way would be to train a generative model given sufficient data of an entire class of target traffic, and use that to generate the traffic schedule you want. But I hope there are cheaper heuristics that just raise the floor of detection high enough to achieve circumvention.

Edit: One more thing. I believe the traffic schedule should be more general than site-based. Re:

If I (as Xray operator) operated the target website myself, I would have precise traffic measurements to the real website that I can extract patterns from

There are several issues with this level of specificity. It's not economical to require every operator to generate their own traffic schedules as it requires highly automated tooling for generation (which OS, which browser, generate a schedule per OS/browser? What about update or concept drift?), and more tools for verifying the generated schedules are actually ok (a. The operator can inadvertently generate a traffic profile of e.g. Google.com that is known to every website fingerprinter if they choose to mirror it; b. Is even possible to have this kind of adversarial tools). The schedules generated from the data of one website, due to its limited scope, may be too specific to risk becoming the classic parrot.

wkrp commented 1 year ago

But I hope there are cheaper heuristics that just raise the floor of detection high enough to achieve circumvention.

My thoughts are in this direction as well. In website fingerprinting research they always try to quantify the overhead: how much the defense costs, in terms of bandwidth and latency. But for our purposes, it's likely that the very beginning of a connection is, by far, the most important. (Probably just the first few packets, even.) If we traffic-shape just, say, the first 10 KB of a connection in both directions, and revert to "natural" shaping after that, that's likely to put us ahead of the game for a long time, and the overhead will be asymptotically negligible.

Rather than first trying to figure out the question of what a traffic schedule should look like, I'm thinking about the possibility of defining a "challenge" with a few simple schedules for circumvention developers to try implementing. These would not be "strawman" schedules, not designed for effective circumvention, but just to give developers a common target to work towards (as I expect implementing even these will require some internal code restructuring). When a few project have developed the necessary support for shaping traffic according to a schedule, then it will be easier to experiment with alternatives.

This is the kind of thing I am thinking of:

Traffic schedule I (constant rate, no randomness, client and server schedules independent)

Client

Connect to server.
Send a burst of 5 KB.
Sleep 500 ms.
Go to 2.

Server

Wait for incoming connection.
Send a burst of 5 KB.
Sleep 500 ms.
Go to 2.

Traffic schedule II (server starts, random sizes)

Client

Connect to server.
Wait to receive at least 100 bytes from server.
Send an amount of data randomly selected from {120, 170, 250} bytes.
Send 1400x + y bytes, where x is random in {0, …, 5} and y is random in {0, …, 1400}.
Sleep (100 + 100×Beta(1.0, 5.0)) ms.
Go to 4.

Server

Wait for incoming connection.
Send an amount of data randomly selected from {250, 255, 270}.
Wait to receive at least 1000 bytes from client.
Send 1400x + y bytes, where x is random in {0, …, 5} and y is random in {0, …, 1400}.
Sleep (100 + 100×Beta(1.0, 5.0)) ms.
Go to 4.

Traffic schedule III (random sizes, multiple simulated processes, dependence on different kinds of inner messages)

Client

Connect to server.
Send a number of bytes randomly selected from {1000, 1200, 1250}.
Independently:
1. Send random(100, 4000) bytes, sleep random(10, 50) ms, repeat.
2. Every 20 s, send a "ping" message of 20 bytes.

Server

Wait for incoming connection.
Wait for at least 1 byte from client.
Independently:
1. Send random(100, 4000) bytes, sleep random(10, 50) ms, repeat.
2. Wait for a "ping" message, send a "pong" message of 40 bytes, repeat.

yuhan6665 commented 1 year ago

@wkrp thanks for the writeup. Seems a good coding task for us. So far we tried to implement a simple and efficient structure in Xray for padding and shaping the first few packets. It only account for number of packets. In the future, we should add more state of traffic like bytes received. We are currently looking to release the customization capability of these schedules to user. I wonder what is the suitable way/level of config. (@RPRX thought of putting everything into a "seed" like Minecraft)

I also find your choice of specific number interesting. Some number I roughly get, like 1400, ping 20 pong 40 are common traffic patterns. What about {120, 170, 250} {250, 255, 270}? Also what is the probability meaning for choosing Beta distribution?

wkrp commented 1 year ago

The numbers are just some numbers I made up. There's no meaning to them—I was just trying to think of some schedules that might pose some design challenges.

Please do not take the ideas I sketched as recommendations for good traffic schedules. They are bad traffic schedules, in fact. I am just brainstorming ways to make progress towards general and effective traffic shaping. My thinking is that there are two obstacles: (1) current systems need to be rearchitected to be more flexible in the kind of traffic shaping they support, and (2) we need to find out what traffic schedule distributions are practical and effective. I find myself thinking about (2) perhaps too much (as in https://github.com/net4people/bbs/issues/281#issuecomment-1703497132), and I reflected that a more productive path forward may be to get more developers thinking about (1). We can tackle problem (1) first, targeting artificial "strawman" traffic schedules; then we'll have the infrastructure necessary to comfortably experiment with problem (2). My idea was that posting a list of concrete "challenges", we can get everyone working on a common problem and thinking about the issues involved.

I didn't intend https://github.com/net4people/bbs/issues/281#issuecomment-1724755111 to be a final list of recommendations. I think it should get some more thought. But a list of traffic shaping challenges could look something like that.

The beta distribution is just from my intuition that uniform distributions are maybe not the best for natural-looking traffic features. But it's not important: the goal is not to prescribe a specific algorithm for implementation, it's to demonstrate that the software can handle different kinds of distributions. You can replace it with a uniform distribution or whatever. These are not recommendations for anything to be shipped to users, at this point.

What I mean when I talk about design questions involved in traffic shaping, is that implementing even a simple traffic schedule requires at least two things:

A send buffer of outgoing data that is ready to be sent, but it waiting for the traffic scheduler to schedule a send event.
A padding generator to create data to send when the traffic scheduler calls for it, even if there is no "real" data waiting in the send buffer.

These two things are what is required to move beyond simplistic, one-packet-at-time padding, and really decouple the observable traffic features of the tunnel from the traffic features of the application protocol inside the tunnel.

Implementing this properly may require you to turn the main loop of your program "inside-out". I wrote about this in the past and made a sample patch for obfs4proxy:

https://lists.torproject.org/pipermail/tor-dev/2017-June/012310.html

The current implementation, in pseudocode, works like this (transports/obfs4/obfs4.go obfs4Conn.Write):
on recv(data) from tor:
  send(frame(data))
If it instead worked like this, then obfs4 could choose its own packet scheduling, independent of tor's:
on recv(data) from tor:
  enqueue data on send_buffer

func give_me_a_frame(): # never blocks
  if send_buffer is not empty:
      dequeue data from send buffer
      return frame(data)
  else:
      return frame(padding)

in a separate thread:
  buf = []
  while true:
      while length(buf) < 500:
          buf = buf + give_me_a_frame()
      chunk = buf[:500]
      buf = buf[500:]
      send(chunk)
      sleep(100 ms)
The key idea is that give_me_a_frame never blocks: if it doesn't have any application data immediately available, it returns a padding frame instead. The independent sending thread calls give_me_a_frame as often as necessary and obeys its own schedule. Note also that the boundaries of chunks sent by the sending thread are independent of frame boundaries.

I attach a proof-of-concept patch for obfs4proxy that makes it operate in a constant bitrate mode.

Also compare to the discussion in the recent "Security Notions for Fully Encrypted Protocols":

To avoid traffic analysis based on message length, we give a novel security notion for FEPs called length shaping, in part inspired by real-world concerns. It requires that the protocol be capable of producing any given number p of bytes of valid ciphertext data on command. While protocols like Obfs4 will add specified padding to the input, we require length shaping to apply to the output to provide greater control over the lengths of network messages. Length shaping precludes the existence of a minimum message length, and, more generally, the output lengths can be shaped arbitrarily, such as into a data-independent pattern or that of a different FEP.

In their sample protocol of Figure 1, obuf is the send buffer I talked about, and ℓ_p‖0^ℓ_p is the padding generator.

stevejohnson7 commented 10 months ago

The paper you are seeking about TLS-in-TLS detection is released in USENIX 2023: https://www.usenix.org/conference/usenixsecurity24/presentation/xue

Source: Xray Telegram Group

yuhan6665 commented 10 months ago

Thanks for sharing. Diwen Xue also recommended another paper of interest https://www.robgjansen.com/publications/precisedetect-ndss2024.pdf

klzgrad commented 10 months ago

Thanks for sharing. Diwen Xue also recommended another paper of interest https://www.robgjansen.com/publications/precisedetect-ndss2024.pdf

The recommendation suggests that even a detector with less than practical precision/false positive rate can not be underestimated, because it will intuitively become more powerful when it gets aggregated in coarse grained analysis. So obfuscation strength matters quantitatively and it's always useful to increase it.

A simple countermeasure to host-based analysis is to insert dummy flows at host level. But it may be logistically difficult to generate diverse traffic from diverse sources to the circumvention bridge at low cost.

wkrp commented 10 months ago

Thanks for sharing. Diwen Xue also recommended another paper of interest https://www.robgjansen.com/publications/precisedetect-ndss2024.pdf

There is a thread now for this paper, with a summary: #312.

net4people / bbs

Looking for: A good paper on how TLS-in-TLS detection works? #281