ANRW talk on inferring contents of HTTPS connections

wkrp commented 5 years ago

The Applied Networking Research Workshop next week (July 22) will have an invited talk called "Limitless HTTP in an HTTPS World: Inferring the Semantics of the HTTPS Protocol without Decryption."

Two of the speakers (Blake Anderson and David McGrew) have previously published research papers on inferring the contents of TLS without decryption:

Enhanced telemetry for encrypted threat analytics, McGrew and Anderson, 2016
Deciphering Malware's use of TLS (without Decryption), Anderson, Paul, and McGrew, 2016
Identifying Encrypted Malware Traffic with Contextual Flow Data, Anderson and McGrew, 2016

https://lists.w3.org/Archives/Public/ietf-http-wg/2019JulSep/0072.html I believe this talk will be of interest to most of our community, so I wanted to highlight it as you plan your week if you'll be coming to Montreal.

on Monday at 14:30, conveniently before the first HTTPbis meeting of the week: Limitless HTTP in an HTTPS World: Inferring the Semantics of the HTTPS Protocol without Decryption Invited talk Blake Anderson(Cisco), Andrew Chi(University of North Carolina), Scott Dunlop(Cisco), and David McGrew (Cisco)

The ANRW program page says the talk will be recorded:

The ANRW will be streamed live and recorded. Remote participation will be provided using the IETF remote participation system. You can find more details and information on how to register for remote participation on the IETF 105 meeting page. There is no charge for remote participation, but pre-registration is required.

Recordings will be made available after the workshop.

wkrp commented 5 years ago

The talk turns out to have been based on a paper from CODASPY 2019:

The paper shows how to passively infer certain HTTP features from an HTTPS connection, given only the stream of TLS records. These HTTP features include the status-code, the method, the contents of certain header fields, and the presence/absence of certain other header fields. Their generally high rates of inferring these features show a partial failure of TLS to provide confidentiality. They apply the technique to a "defensive" use case—malware classification—and an "offensive" use case—website fingerprinting—and propose the technique as an alternative to full HTTPS MITM interception and the associated difficulties.

The technique requires a large, diverse, and up-to-date corpus of labeled samples; i.e., correlations between HTTPS features and the underlying ground-truth HTTP features. (In the talk, Anderson says that 99% of what they did was building the training datasets, and the remaining 1% was some light machine learning.) They focused on HTTP/1.2 and HTTP/2 in TLS 1.2, with Firefox, Chrome, Tor Browser, and a collection of Windows malware. To get ground-truth decryptions of recorded traffic, they needed to recover keys. In Firefox and Chrome, they used the SSLKEYLOGFILE environment variable. In Tor Browser and the malware, they extracted keys from memory snapshots.

The effects on malware classification were good. They enriched standard malware classification features, such as the list of ciphersuites, with their HTTP inference features and notably improved over the state of the art. In contrast, the additional HTTP features did not help with website fingerprinting against Tor Browser, which the author attribute mainly to Tor's multiplexing of multiple HTTPS sessions within one TLS connection.

wkrp commented 5 years ago

The video of the talk by Blake Anderson is now available. 19 minutes talk, 6 minutes questions.

YouTube video (archive)

slides (archive)

net4people / bbs

ANRW talk on inferring contents of HTTPS connections #7