Closed koheiw closed 2 years ago
Entropy for any vector of equal values will be log(K) where K is the length of the vector. So entropy with log_2 for c(2, 2, 2) is
> log(3, base = 2)
[1] 1.584963
Getting entropy of 1 for a vector whose length is equal to the log base (as in the text3 example for feature entropy above) is a special case.
> entropy::entropy(c(2, 2), unit = "log2")
[1] 1
This is pretty standard. I took the default base 2 from Shannon's original article's suggestion since it measures (binary) bits of information.
We could add a rescaling argument to divide the result by log(K), which will return the range to [0, 1], but I think it could be confusing to make this a default.
> vec <- rep(2, 5)
> entropy::entropy(vec, unit = "log2")
[1] 2.321928
> entropy::entropy(vec, unit = "log2") / log(length(vec), base = 2)
[1] 1
> entropy::entropy(vec, unit = "log10") / log(length(vec), base = 10)
[1] 1
Base is 2 in the original articles because coin can take only two values, isn't it? In texts, the number of values the vector can take is the number of unique document/feature identifiers.
Not because of a coin, rather it was about the most basic state of signal represented by bits or binary digits.
Shannon, Claude Elwood. "A mathematical theory of communication." ACM SIGMOBILE mobile computing and communications review 5.1 (2001): 3-55. (reprinted from the 1848 original)
What if we implement the rescaling as a new argument, normalize = FALSE
?
I read a bit about entropy and learnt that these two are the same.
entropy::entropy(vec, unit = "log2") / log(length(vec), base = 2)
entropy::entropy(vec, unit = length(vec))
It is from P15 of a nice text book, Elements of information theory.
If we base = 2, the entropy is the number of binary digits required to convey the information. It is not particular useful, but I agree that it is the standard set up.
Agreed it's not particularly useful in the context of feature counts. Happy to implement the normalisation argument, or we could suggest in the help that this normalisation could offer a useful method for comparing entropies across dfms.
I think entropy should be 1 for uniform distribution and do not exceed the value. If so,
base = 2
is not always correct.We can change to
based = NULL
by default and set automatically depending on the margin and the size of the input DFM.