net4people / bbs

Forum for discussing Internet censorship circumvention
3.35k stars 78 forks source link

Cbs: A Deep Learning Approach for Encrypted Traffic Classification with a Mixed Spatio-Temporal and Statistical Features Classification #385

Open irgfw opened 4 weeks ago

irgfw commented 4 weeks ago

This paper is connected to the new minister of communications (Sattar Hashemi) in Iran: https://x.com/ircfspace/status/1823434976844448095

However, the date of this paper is near to some IRGFW upgrades and internet censorship capabilities.

Abstract

With the rapid development of the internet and online applications, traffic classification not only is an attractive topic in the field of computer networks but also plays a critical and vital role in managing network resources, enhancing the quality of network service and cyber-security. Internet traffic encryption has recently received significant attention because of the growing number of applications and the necessity for privacy. Traffic encryption techniques have caused conventional traffic classification approaches to become inefficient and inaccurate. Due to the limitations of conventional traffic classification methods, such as port-based, payload-based, and machine learning-based techniques, the scientific community currently regards deep learning as a high-performance approach to classifying encrypted traffic. In this paper, an encrypted traffic classification approach based on a deep learning technique, CBS, is proposed. CBS can classify encrypted traffic at two levels using CNN, attention-based Bi-LSTM, and SAE deep network models. The proposed model classifies the types of traffic and applications based on a comprehensive set of session and packet-level features. After traffic preprocessing, the session and packet features are fed into the proposed framework. In addition, a traffic data augmentation technique based on a GAN network is applied to alleviate the impact of imbalanced data on particular traffic classes. The performance of the proposed framework was evaluated on the public ISCX VPN-Non VPN 2016 dataset. The results demonstrate that the framework accurately and efficiently identifies the application and classifies encrypted traffic. Compared to the state-of-the-art methods, the proposed traffic classification model improved precision by 21.56%, recall by 18.33%, and F1 by 19.98%.

Keywords: Encrypted Traffic, Deep Learning, Traffic Classification, Imbalanced data, Packet Features

Link 1: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4189457 Link 2: https://www.researchgate.net/publication/376543417_CBS_A_Deep_Learning_Approach_for_Encrypted_Traffic_Classification_with_Mixed_Spatio-Temporal_and_Statistical_Features

wallpunch commented 2 weeks ago

Has there been any research into using deep learning to discover/generate obfuscation protocols that are particularly hard for other deep learning models to classify correctly? Basically swapping the "generator" and "discriminator" elements of the GAN method described in this paper. Intuitively it seems like obfuscation should be easier than classification if both sides have equivalent resources.

wkrp commented 2 weeks ago

Has there been any research into using deep learning to discover/generate obfuscation protocols that are particularly hard for other deep learning models to classify correctly? Basically swapping the "generator" and "discriminator" elements of the GAN method described in this paper. Intuitively it seems like obfuscation should be easier than classification if both sides have equivalent resources.

I don't know that sub-area of research super well, but there are at least a couple. Maybe others can suggest more.

Learning to Behave: Improving Covert Channel Security with Behavior-Based Designs (PETS 2022) Ryan Wails, Andrew Stange, Eliana Troper, Aylin Caliskan, Roger Dingledine, Rob Jansen, Micah Sherr

Censorship-resistant communication systems generally use real-world cover protocols to establish a covert channel through which uncensored communication can occur. Unfortunately, many previously proposed systems use cover protocols inconsistently with the way humans normally use those protocols, leading to anomalous network traffic patterns that have been shown to be discoverable by real-world censors. In this paper, we argue that censorship-resistant communication systems should follow two behavior-based design properties: (i) behavioral independence: systems should isolate the operation of their covert channels from the operation of their cover protocols, and (ii) behavioral realism: systems should either opportunistically use existing genuine cover protocol instances or run new protocol instances that are modeled after genuine ones. These properties ensure that the behavior of a system’s users will not degrade its security. We demonstrate how to achieve these properties through the design and evaluation of Raven, a censorship-resistant messaging system that uses email cover protocols identically to the way humans use email. Raven uses a generative adversarial network that is trained on genuine email data to control the timing and sizes of the email messages it sends and receives, and these messages are transferred independently of user actions. Our evaluation shows that, compared to the state-of-the-art email-based Mailet system, Raven raises the false-positive rate from 3% to 50% when detecting covert channel usage with 100% recall.

Voiceover: Censorship-Circumventing Protocol Tunnels with Generative Modeling (FOCI 2023) Watson Jia, Joseph Eichenhofer, Liang Wang, Prateek Mittal

Censorship regimes are continuously adopting and deploying state-of-the-art techniques to detect and prosecute open communication on the internet. Multimedia protocol tunneling seeks to disguise covert data communication by processing it directly through a legitimate audio/video communication system. Systems like VoIP and video streaming services use variable bitrate encoding schemes, which leak characteristics of the content they carry through packet sizes and timing. In what we call a content mismatch attack, censors can distinguish between a channel carrying legitimate media content and one carrying covert data content. We address content mismatch attacks by introducing a novel traffic-shaping technique that models the normal media content and applies its properties to the covert content. We constructed a generative machine learning model to restrict covert data transmission such that its timing properties match properties learned from real two-person conversations. Our evaluation finds that modeling the timing properties in the application layer content reduces distinguishing features in the encrypted network traffic. This mitigates content mismatch attacks on coarse-grained timing properties.

It's a promising approach to the "what should traffic shaping look like" question, as considered in #281 and elsewhere.