In the Line of Fire: Risks of DPI-triggered Data Collection (CSET 2023)

mmmray commented 9 months ago

https://www.sysnet.ucsd.edu/~voelker/pubs/fireeye-cset23.pdf

just wanted to share this. basically, what happens is:

user sends unencrypted traffic on university network to his own webserver
suddenly gets requests to the same paths from all over the world
confirms that a middlebox is intercepting traffic, and antivirus attempts to re-fetch the resources from other IPs to evaluate content

i think the methods described to locate the censor are inspiring:
send (purposefully) unencrypted requests with very unique paths, to later wait for probing requests
use TTL on packet to identify where in the network traffic is monitored

gaukas commented 9 months ago

Thanks for sharing. That's an interesting paper to read for people in this community and even for overall network measurement audiences.

send (purposefully) unencrypted requests with very unique paths, to later wait for probing requests

If you were talking about triggering active probing against censorship circumvention setups/servers, it might actually get more complicated than what's being described in the paper. If I understand correctly, the authors of this paper were triggering (specifically) probes against a certain URL served on a plaintext HTTP server that does not offer directory listing. This has made it easy to filter the Internet Background Radiation.

Circumvention community on the other hand deals with more straightforward probes on TCP ports against common application protocols such as TLS, which receives continuous probe with or without a trigger. It involves non-trivial effort to tell how many of the probes are actually triggered, and a even more challenging task is finding out probes that are sent by a censor instead of a random network measurement institute, since many of these institute are allegedly equipping some of the middleboxes with reactive probing* mechanisms.

But again I agree this is a very insightful paper and it might worth to think about good ways for the circumvention community to identify possible active probing attacks, for a better understanding and better defense against them.

*from Examining How the Great Firewall Discovers Hidden Circumvention Servers by Ensafi et al., in contrast to proactive probing

wkrp commented 9 months ago

This behavior is associated with a feature called FireEye Advanced URL Detection Engine (FAUDE). FAUDE is a component in a suite of features designed to identify and block malicious URLs… Key to their approach is a classifier that identifies suspicious URLs in real time for further evaluation by a cloud-based service (in the NX context these are unencrypted URLs observed over passively monitored links). This design has two ramifications: first, that there is outbound telemetry about such URLs from the product to the FireEye cloud1 and second, that the FireEye cloud service then visits those URLs to enable further analysis.

The source of these fetches was distributed because FireEye, like most companies collecting threat intelligence, must be careful that their data-collection infrastructure is not "fingerprinted" and explicitly blocked by adversaries. Thus, FireEye employs a collection of proxies used to obfuscate their origin. … We observed 568 unique source IPs (hereafter "proxy") which collectively issued 235,393 requests to our sink server during the period of our study.

It says they first discovered the feature, by accident, in July 2020. They reported it to FireEye then, saying that certain aspects were security flaws (such as blindly trusting Host headers, turning the FireEye proxies into instruments of CSRF). They do not report the list of FireEye proxy IP addresses, though that would likely be easy to figure out, given access to a network path with a FAUDE-equipped device installed.

Because FAUDE treats the captured Host header as canonical, it is possible to drive the FireEye proxy hosts to issue arbitrary GET requests to any host…

The OSS paper from 2013 (see also talk slides) showed how systems like this, that make HTTP requests on demand, can conceivably be used as a circumvention proxy, either for one-shot rendezvous or as a main data channel. The main idea is to embed the data you want to send in the URL. If the scanning service follows HTTP redirects (it's not clear if FAUDE does), then communication can even be bidirectional.

The FireEye paper is somewhat reminiscent of some strange network behavior some colleagues and I observed while scanning the Internet for SNI proxies in 2016:

https://www.bamsoftware.com/computers/sniproxy/#parallel-scans

Since there were 2,500 SNI proxies, we should have gotten 2,500 connections from them alone, ignoring Internet background traffic. In fact, we received 6,627 connections (6,474 with the correct SNI), and 3,696 came from a single IP address, 192.107.156.196.

208.80.194.26 (static-208-80-194-26.as13448.com) is interesting: AS13448 belongs to Websense, a web filtering company. They must do centralized, automatic scans of SNIs seen by their firewall installations.

The clear outlier is 192.107.156.196. During the 8 hours of the scan, this IP address made TLS connections with the correct SNI to our dedicated HTTPS server at a roughly constant rate. It stopped making connections after our scan stopped. Our scan had not touched anything in 192.107.156.0/24 when the first connections from 192.107.156.196 begin to arrive.

To be clear: we measured these connections at the dedicated web server running at sni-scan-for-research-study.bamsoftware.com, the host specified in our SNI, not at the host running ZMap. In other words, it was not a matter of a host back-scanning the source of a detected scan. Instead, while we were scanning random HTTPS servers, an unrelated host—192.107.156.196—was scanning the same host that we specified in our SNI.

One might suppose that 192.107.156.196 is merely a shared exit point for many other SNI proxies. But this can't be the case. For one thing, 192.107.156.196 made more connections to the HTTPS server than we made in total. And for another, the TLS client handshake differed from the ones that our scanner program produced—for example, it used session resumption. If it were merely tunneling our own connections, it would not have been able to modify the TLS handshake and still have validation succeed. Instead, this host was independently sending its own TLS probes while we were doing our scan.

whois says the IP address belongs to Harris Corporation.

net4people / bbs

In the Line of Fire: Risks of DPI-triggered Data Collection (CSET 2023) #293