Balboa: Bobbing and Weaving around Network Censorship (USENIX Security 2021)

Balboa: Bobbing and Weaving around Network Censorship Marc B. Rosen, James Parker, Alex J. Malozemoff https://censorbib.nymity.ch/#Rosen2021a Presentation video and slides

Balboa is a framework for link obfuscation that is in the same vein as Slitheen and Protozoa. The goal of all these systems is to embed a hidden communications channel inside some other network flow, without changing any externally observable features of the flow, particularly the traffic analysis features of packet sizes and packet timings. Balboa, like the others, works by traffic replacement: it removes some encrypted portions of the carrier flow and replaces them with identically sized encryptions of covert data. Balboa assumes TLS for the carrier flow (Slitheen also used TLS, Protozoa used WebRTC video). It has some unique advantages: by hooking into TLS libraries and intercepting networking system calls, it can use unmodified application binaries at both ends; and it undoes its traffic replacement before passing decrypted TLS payloads to upper network layers, which means the application programs behave identically to how they would behave in the absence of traffic replacement. The authors provide two instantiations of system, one that uses an Icecast audio stream, and one that works over HTTPS web browsing.

The most significant difference in Balboa is its use of a preshared traffic model. Client and server are assumed not only to share a symmetric key, but also to know in advance some portion of what the application program on the other side will send through the tunnel—this is the portion that is eligible for traffic replacement. For example, in an audio streaming setting, the Balboa client may already have a copy of some of the audio files that the server will later stream. The Balboa server also knows in advance what audio files the client has. When the Balboa server would stream one of the files the client already has, it instead replaces (under TLS encryption) the file's contents with a pointer into the traffic model ("for the next file, substitute 'Metallica - Fuel.ogg'"), followed by covert data for the remainder of the file size. The Balboa client sees and interprets the traffic model pointer, and re-substitutes its local copy of the file (i.e., the very same bytes the client would have received from the server if there had been no traffic replacement) into the data stream that it passes up the network stack, meanwhile saving the covert data somewhere else. Covert data is sent only when an application program would be sending data anyway, and only when what is being sent is part of the shared traffic model. The traffic model is what enables both sides to "fill in the gaps" that traffic replacement creates in the data stream. In the authors' two instantiations of Balboa, the traffic model is a full copy of files to be sent later (in order to send N bytes of covert data, the peers need to have preshared N bytes' worth of files), but one can imagine a procedural, or other more concise representation of a traffic model (Section 2.1).

A considerable part of the paper is devoted to showing how various engineering challenges were overcome. Balboa sets the SSLKEYLOGFILE environment variable, or uses other means, to recover the keys necessary for decrypting and re-encrypting TLS application data records on the fly. But in order to work with unmodified applications while not disturbing packet boundaries, Balboa needs to hook into the network stack even below the TLS library, at the level of C library functions like read and write. This occasions considerable complexity, as Balboa needs to cope with TLS records that are not aligned with the buffers used by the low-level calls (Section 2.5). Balboa also needs to run its decryption and re-encryption quickly, because any small processing delays are potential distinguishers. But because of the way Balboa works, such delays are really the only leverage a censor has to distinguish flows.

Thanks to the authors for reviewing a draft of this summary.

It is interesting to consider the design tradeoffs that permit Slitheen and Protozoa to have the indistinguishability property, despite not having an elaborated traffic model like Balboa. Both of them limit themselves to replacing content types that are assumed to be free of side effects, in terms of application behavior. Slitheen calls these "leaf" content types, and Protozoa re-substitutes a valid video ("Prevent Video Decoding Malfunction") into the decrypted stream it gives to the media player—it's assumed the video player will behave the same for any valid video. HTML is an example of a content type that is not free of side effects, since HTML may contain references to other resources that cause a browser to make network requests; if the HTML were overwritten by traffic replacement, the browser would not make those requests, which would be an observable difference in behavior. The advantage of the Balboa traffic model is that even such non-leaf content types are candidates for traffic replacement; the cost is the complexity of specifying the model. Balboa has a stronger guarantee of not affecting program behavior, since it does not require predicting how a program will react to seeing a certain file. An image file in HTTP, for example, may or may not actually be a leaf resource: a JavaScript program can renders an image into a canvas and then inspect the pixels, and potentially change it behavior based on the image's contents.

Is it necessary to have byte-millisecond-perfect traffic profiles to render traffic analysis unusable in practice? Part of the design requires running real applications for generating interactive behaviors which seems too heavyweight for real world deployment, e.g. on an embedded device. What are the issues if I record some of the traffic models and replay them with some kind of Markov chains and some randomization? I recall some earlier papers doing this but I don't remember immediately if they were shown to be also dead parrots.

The second question is how can such traffic models be created and distributed with minimum cost? If they are centrally created and distributed, it will be used by the censor for detection. If they are locally generated by the user, it's hard to create models with adequate coverage within limited time budget.

Is it necessary to have byte-millisecond-perfect traffic profiles to render traffic analysis unusable in practice?

In my opinion, no, such a high level of fidelity is not necessary in practice. To me, the value of research like this is in showing that an upper bound is technically achievable, even if it's too cumbersome to easily use.

The way I think of these traffic-shaping models is by analogy to statistical vs. computational indistinguishability in cryptography. If we claim that a ciphertext should resemble a random byte string, the question is, how strong does the word "resemble" need to be? If I encrypt a message M with a one-time pad and give you the ciphertext, there is no test you can perform that will distinguish the ciphertext from a random string, even with unlimited computational power. That's because the output of one-time pad encryption is statistically indistinguishable from random: the distribution of its outputs is the uniform distribution. Whereas if I encrypt M using AES, there is a computational test you can do to distinguish it from random: try every possible key, and see if the ciphertext ever decrypts to M. The output of AES encryption is not statistically indistinguishable from random, but we still say it is computationally indistinguishable from random, since no real-world computational process can try every AES key. Statistical indistinguishability is comparing probability distributions; computational indistinguishability is comparing the behavior of distinguisher algorithms given polynomially many samples from the distributions.

The systems that work by traffic replacement produce traffic patterns that are actually statistically indistinguishable from the applications they use (in theory; in practice there may be differences due to factor like small added computational latencies).

For circumvention, I think we don't even need a notion as strong as computational indistinguishability. It's a tier below those two, which is weighted somehow by the distinguisher paying some "cost" per act of classification and for any misclassifications. (Would be a nice thing to formalize.)

net4people / bbs

Balboa: Bobbing and Weaving around Network Censorship (USENIX Security 2021) #112