onekey-sec / unblob

Extract files from any kind of container formats
https://unblob.org
Other
2.19k stars 80 forks source link

Add Chi square measure to EntropyReport #993

Open qkaiser opened 6 days ago

qkaiser commented 6 days ago

Currently, unblob rely on unblob-native's shannon_entropy to calculate entropy levels of chunks. While valid, this approach is limited in that we cannot rely on it to differentiate between compressed and encrypted data streams.

An improved approach involves Chi-square tests. Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy.

In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns.

According to ent:

The chi-square test is the most commonly used test for the randomness of data, and is extremely sensitive to errors in pseudorandom sequence generators. The chi-square distribution is calculated for the stream of bytes in the file and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated.

It would be nice to expose Chi square levels like we expose Shannon entropy levels through EntropyReport.

We could create two EntropyReport subclasses: ShannonEntropyReport and ChiSquareEntropyReport. They would not need to extend the parent class further. Both are using values that can be interpreted as percentages.

The calculate_entropy function would need to be adapted to return two reports rather than one.

qkaiser commented 6 days ago

Some initial work has been done on https://github.com/onekey-sec/unblob-native/pull/69