shadowsocks / shadowsocks-org

www.shadowsocks.org
MIT License
871 stars 536 forks source link

流量识别论文 #81

Closed wongsyrone closed 6 years ago

wongsyrone commented 7 years ago

https://arxiv.org/pdf/1709.02656.pdf

https://arxiv.org/abs/1709.02656

Mygod commented 7 years ago

https://arxiv.org/pdf/1312.6199v1.pdf

madeye commented 7 years ago

If we consider encryption as a function f(x) (linear or non-linear doesn't matter), it's not surprising that DNN also works here to detect the underlying characters in the cipher text.

Mygod commented 7 years ago

@madeye Incorrect. Since neural network evaluation is a polynomial time algorithm (assuming the corresponding assumption in modern cryptography holds), it can only classify packet content as random noise. Look at section 5 where they said they failed to classify Tor traffic.

wongsyrone commented 7 years ago

I'm not expert on this. Just cite a paragraph I think matters.

Ideal encryption technique

AEAD and other schemes developed by crypto experts

causes the data to bear most possible entropy [47]. In other words, it produces pattern-less data that theoretically can not be discernible form on another.

Tools like sssniff

However due to the fact that all encryption schemes use pseudo-random generators, this hypothesis is not always true in practice. Each application employs its own ciphering scheme for packets encryption. These schemes utilize pseudo-random algorithms which bear distinct patterns. These patterns can be used to distinguish applications from one another.

The initial intention of Shadowsocks: to be indistinguishable from random bytes becomes the pattern being used to detect traffic.


For Tor, if I understand it correctly, they know you are using Tor, but cannot determine the Tor payload, e.g., know if you are visiting Facebook, YouTube, etc. The payload doesn't matter at all; I can simply block connections.

madeye commented 7 years ago

@Mygod Hmm, what I read is

It is noticeable from Figs. 6a and 6b and Table 2 that Tor traffic was successfully classified.

madeye commented 7 years ago

@Mygod And yes, it cannot be classified accurately.

Detailed result can be found at Table 5, but it is obvious that our network was unable to classify Tor's traffic accurately

madeye commented 7 years ago

It's a little bit surprising that TOR can be detected with 100% recall and 100% precision.

image

Two possible reasons:

  1. TOR traffic is very different from other traffic, e.g. high entropy of the first packets like shadowsocks.
  2. They are over-fitting. 😃
Mygod commented 7 years ago

I'm just pointing out confidentiality makes classifying what's being transmitted via Tor hard. From 3.2, its input vector consists of:

Everything besides data which is random noise can be used to leak some information about what's being transmitted via Tor.

What @wongsyrone concerns is actually steganography. Yes confidentiality only makes distinguishing what's being transmitted hard.

wongsyrone commented 7 years ago

Another thought on this.

The training requires data, which has to be pre-processed. Is there any way to make the data collection and processing difficult?

The packet length table shows there are many packets are zero-length. I'm not sure whether they are traffic control packet like ACK, SYN, etc or not.

wongsyrone commented 7 years ago

From my observation, GFW prefers to cut off the root of a tree to make the censorship effectively.

DNS: send fake one just before the real response arrives. HTTPS: reset connection after keywords being triggered. Tor: block directory servers on the list, which can be collected via fake bridge or other kinds of abuse.

Mygod commented 7 years ago

However due to the fact that all encryption schemes use pseudo-random generators, this hypothesis is not always true in practice. Each application employs its own ciphering scheme for packets encryption. These schemes utilize pseudo-random algorithms which bear distinct patterns. These patterns can be used to distinguish applications from one another.

This author clearly does not know what's pseudorandomness. Whatever can be distinguished from true randomness (within polynomial time of course) is not pseudorandomness. The real reason should be the fact that his input takes a lot of extra data from the packet, not just content. (another possibility is that the encryption scheme is faulty)

@wongsyrone If GFW is actively interrupting connections, we can easily take countermeasures, even if they employ DNNs (see the paper I posted in the first comment).

wongsyrone commented 7 years ago

his input

The Deep Packet paper says the dataset is already labeled. Does it mean they already know something and just need to prove it instead distinguish one thing from others?

Mygod commented 7 years ago

By the definition of classifying, no.

Mygod commented 7 years ago

@wongsyrone This should be an interesting read for you: https://arxiv.org/pdf/cs/0702161.pdf

zasdfgbnm commented 7 years ago

@Mygod Hmm... I don't think the computational indistinguishability of pseudorandom generators implies infeasibility to classify them in real world usage.

The definition of pseudorandom generator G is, for uniform distribution U, {G(U)} and {U} are computationally indistinguishable. Clearly, if you have two pseudorandom generator G and H, since both {G(U)} and {H(U)} are computationally indistinguishable with {U}, {G(U)} and {H(U)} are computationally indistinguishable with each other.

But in real world usage, the input of G and H are not uniform distribution U, but some something else instead, say the distribution of youtube videos. In this case, the output of G and H might be very different, making it possible to distinguish them from each other.

Mygod commented 7 years ago

@zasdfgbnm I agree. But I didn't mention pseudorandom generators at all, did I? Yeah the article mentioned PRGs but most secure ciphers are designed to be (computationally) indistinguishable from random noise regardless of PRGs. We aren't just invoking PRGs to make encryption schemes. That would be ridiculous.

zasdfgbnm commented 7 years ago

@Mygod

However due to the fact that all encryption schemes use pseudo-random generators, this hypothesis is not always true in practice. Each application employs its own ciphering scheme for packets encryption. These schemes utilize pseudo-random algorithms which bear distinct patterns. These patterns can be used to distinguish applications from one another.

Yes, I totally agree with you here that this explanation is ridiculous and the author clearly does not understand computational indistinguishability well.

I'm just saying that, although ciphers are transformations that the output is computationally indistinguishable from each other given that the input is uniform distribution, it is still possible sniff protocol because this is a different problem from distinguishing two pseudo-random functions.

wongsyrone commented 6 years ago

who can download this: The Random Forest Based Detection of Shadowsock's Traffic

http://ieeexplore.ieee.org/document/8048116/

wongsyrone commented 6 years ago

@madeye @Mygod

download and save as pdf file

EDIT: link removed

zeptoTantalum commented 6 years ago

I'm a bit involved in the field of deep learning. This paper really confuses me. If I'm not mistaken, there are using CNN and SAE on the content of a single packet, instead of sequences of packets. In this case, how could they distinguish between Vimeo and Youtube, both of which uses HTTPS? Since they are working on a single packet, most of those packets are completely encrypted with no extra information. I wonder if anyone can spare some time to investigate this dataset, reproduce the result, and find out what exactly their model has learned.

wongsyrone commented 6 years ago

confused me as well.

another one try to identify Tor traffic may contain what we need: http://ieeexplore.ieee.org/document/8048117/

zeptoTantalum commented 6 years ago

@wongsyrone @madeye Seems this paper follows a pretty normal pattern for Machine Learning applications. It's not a amazing one, it just brings up the problems again, which has led to a lot of argument among those work against GFW: Is it possible to identify Shadowsocks traffic, only by extracting features from network traffic? That's the problem, whether they are using CNN/Auto-encoder/MLP/Random forest/SVM doesn't really matter.

wongsyrone commented 6 years ago

These days, many people claim their Shadowsocks are down for hours and we don't know how. Sad.

zeptoTantalum commented 6 years ago

@wongsyrone Personally, I am a bit suspicious what @breakwa11 has claimed. The problem is no one really knows how GFW works. @breakwa11 has done a lot of works trying to discover the network traffic features of Shadowsocks/v2ray/obfs4, those features are basically from the length, direction and timestamp of a packet. And the way she propose to hide those feature is to randomize the length of those packet. However, it's hard to believe connection in real world really has packet with a almost uniform distribution among different length. It gonna just introduce another kind of feature.

zeptoTantalum commented 6 years ago

If it is really possible to identify shadowsocks, via packet length sequence, a better way might be to use machine learning to learn a generation model for the packet length pattern of a "normal" (WordPress, maybe?) website, and use this pattern to generate the length of TCP transport unit. But this really sounds complex, and whether it is necessary is unknown.

zeptoTantalum commented 6 years ago

The best way to config shadowsocks, from my opinion, is to purchase a domain, setup a personal website, , using tls, and reverse proxy to shadowsocks via websocket, and use CloudFlare CDN. Ignoring traffic patterns related to packet length, even if we are doing this, it is still a little suspicous: You are connecting to a very unpopular domain name (revealed by SNI) for a long time, and with a surprisingly high traffic. This can be only resolved by the method used in Tor Meek. However, CDN supporting HTTP host header over SNI is very expensive. So there might be no "perfect" way to break GFW.

wongsyrone commented 6 years ago

@zeptoTantalum

is to purchase a domain, setup a personal website, , using tls

similar to v2ray, do you have any idea on it?

zasdfgbnm commented 6 years ago

Do we really need to worry about deep learning based packet sniffing? GFW is something that should work at a rate close to link speed. Deep learning is slow and I don't think it is possible to use deep learning methods to look at every international packet.

zeptoTantalum commented 6 years ago

@wongsyrone https://github.com/shadowsocks/simple-obfs/pull/74 seems simple-obfs already support it? @zasdfgbnm Using deep learning to detect the content of every packet sounds impossible, but it's still possible to just record the network activity, and use a Maching Learning model to analyze them periodically. Deep Learning is not always that slow, huge models like Vgg/ResNet can have tens/hundereds of millions of parameters are slow but they will use much simplier ANN architectures for the GFW. And there are still "traditional" ML methods available. That being said, I still doesn't know if they really use ML to build GFW.

DarienRaymond commented 6 years ago

@zeptoTantalum Sadly it is not a true WebSocket. Easily detectable.

ghost commented 6 years ago

@wongsyrone https://sci-hub.bz/10.1109/IHMSC.2017.132 for the random forest based detection of shadowsocks's traffic

wongsyrone commented 6 years ago

Thanks. But http://ieeexplore.ieee.org/document/8048117/ is more interesting, while most people ignored it.

Mygod commented 6 years ago

@wongsyrone Papers these authors spammed are far less interesting than the one you originally posted.

It's not very realistic for a country scale firewall to record every international packets and do statistical analysis.

Mygod commented 6 years ago

Just to clarify: the paper from the original thread classifies Tor traffic without pluggable transport. This is something GFW already can do back in 2011 without deep learning. (source)

wongsyrone commented 6 years ago

Might inspired by http://ieeexplore.ieee.org/document/7249493/

wongsyrone commented 6 years ago

another one: http://ieeexplore.ieee.org/document/7120915/

Mygod commented 6 years ago

Not sure why you're bring up these. For 7249493, it's basically learning network traffic behavior under a wireless setting where every packet is broadcasted. This is much harder to do for the Internet.

For 7120915:

The proposed covert channel has the following limitations. First, it can only be used to transmit small amount of auxiliary data, but cannot be used to transfer data of large size. Transferring big amount of data in the covert channel is very inefficient and easy to be detected...

Mygod commented 6 years ago

Statistical flow analysis is generally expensive to implement on country-scale firewalls. The second one is less relevant. By the way if they really want to attack anonymity networks, they should target Tor instead. Shadowsocks only offers a moderate amount of anonymity protections.