Currently, unblob rely on unblob-native's shannon_entropy to calculate entropy levels of chunks. While valid, this approach is limited in that we cannot rely on it to differentiate between compressed and encrypted data streams.
An improved approach involves Chi-square tests. Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy.
In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns.
The chi-square test is the most commonly used test for the randomness of data, and is extremely sensitive to errors in pseudorandom sequence generators. The chi-square distribution is calculated for the stream of bytes in the file and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated.
It would be nice to expose Chi square levels like we expose Shannon entropy levels through EntropyReport.
We could create two EntropyReport subclasses: ShannonEntropyReport and ChiSquareEntropyReport. They would not need to extend the parent class further. Both are using values that can be interpreted as percentages.
The calculate_entropy function would need to be adapted to return two reports rather than one.
Currently, unblob rely on unblob-native's
shannon_entropy
to calculate entropy levels of chunks. While valid, this approach is limited in that we cannot rely on it to differentiate between compressed and encrypted data streams.An improved approach involves Chi-square tests. Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy.
In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns.
According to ent:
It would be nice to expose Chi square levels like we expose Shannon entropy levels through
EntropyReport
.We could create two
EntropyReport
subclasses:ShannonEntropyReport
andChiSquareEntropyReport
. They would not need to extend the parent class further. Both are using values that can be interpreted as percentages.The
calculate_entropy
function would need to be adapted to return two reports rather than one.