quanteda / quanteda.textstats

Textual statistics for quanteda
GNU General Public License v3.0
14 stars 2 forks source link

Make the base of entropy to be ndoc() or nfeat() #49

Closed koheiw closed 2 years ago

koheiw commented 2 years ago

I think entropy should be 1 for uniform distribution and do not exceed the value. If so, base = 2 is not always correct.

We can change to based = NULL by default and set automatically depending on the margin and the size of the input DFM.

require(quanteda.textstats)
#> Loading required package: quanteda.textstats
require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.0
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 6 of 6 threads used.
#> See https://quanteda.io for tutorials and examples.

dfmt <- as.dfm(matrix(c(2, 2, 2, 1, 0, 2), nrow = 3))
dfmt
#> Document-feature matrix of: 3 documents, 2 features (16.67% sparse) and 0 docvars.
#>        features
#> docs    feat1 feat2
#>   text1     2     1
#>   text2     2     0
#>   text3     2     2
textstat_entropy(dfmt, "documents", base = nfeat(dfmt))
#>   document   entropy
#> 1    text1 0.9182958
#> 2    text2 0.0000000
#> 3    text3 1.0000000
textstat_entropy(dfmt, "features", base = ndoc(dfmt))
#>   feature   entropy
#> 1   feat1 1.0000000
#> 2   feat2 0.5793802
textstat_entropy(dfmt, "features")
#>   feature   entropy
#> 1   feat1 1.5849625
#> 2   feat2 0.9182958
kbenoit commented 2 years ago

Entropy for any vector of equal values will be log(K) where K is the length of the vector. So entropy with log_2 for c(2, 2, 2) is

> log(3, base = 2)
[1] 1.584963

Getting entropy of 1 for a vector whose length is equal to the log base (as in the text3 example for feature entropy above) is a special case.

> entropy::entropy(c(2, 2), unit = "log2")
[1] 1

This is pretty standard. I took the default base 2 from Shannon's original article's suggestion since it measures (binary) bits of information.

We could add a rescaling argument to divide the result by log(K), which will return the range to [0, 1], but I think it could be confusing to make this a default.

> vec <- rep(2, 5)
> entropy::entropy(vec, unit = "log2")
[1] 2.321928
> entropy::entropy(vec, unit = "log2") / log(length(vec), base = 2)
[1] 1
> entropy::entropy(vec, unit = "log10") / log(length(vec), base = 10)
[1] 1
koheiw commented 2 years ago

Base is 2 in the original articles because coin can take only two values, isn't it? In texts, the number of values the vector can take is the number of unique document/feature identifiers.

kbenoit commented 2 years ago

Not because of a coin, rather it was about the most basic state of signal represented by bits or binary digits.

Shannon, Claude Elwood. "A mathematical theory of communication." ACM SIGMOBILE mobile computing and communications review 5.1 (2001): 3-55. (reprinted from the 1848 original)

What if we implement the rescaling as a new argument, normalize = FALSE?

koheiw commented 2 years ago

I read a bit about entropy and learnt that these two are the same.

entropy::entropy(vec, unit = "log2") / log(length(vec), base = 2)
entropy::entropy(vec, unit = length(vec))

image

It is from P15 of a nice text book, Elements of information theory.

If we base = 2, the entropy is the number of binary digits required to convey the information. It is not particular useful, but I agree that it is the standard set up.

kbenoit commented 2 years ago

Agreed it's not particularly useful in the context of feature counts. Happy to implement the normalisation argument, or we could suggest in the help that this normalisation could offer a useful method for comparing entropies across dfms.