Closed nluetts closed 2 weeks ago
Thanks @nluetts. I'll do my best to answer these questions, but some will need more investigation. First, I have not implemented "CDM". It doesn't seem much used in the literature and is largely (as far as I can tell) superceded in the literature by PYM. I'll probably add it to the library at a later date if there is any demand of it.
I can't comment on the quality of the work in Rodriguez et al. and I have not tried to match against their results. I really only used it as a (partial) basis for what to implement. As far as possible, I went back to the authors original implementations (where they existed) and compared output against them.
NSB is an interesting case, as the author's implementation gives different results from the python implementation used by Rodriguez et al., which gives different results from DiscreteEntropy.jl
. I cannot claim with certainty that my version is correct, but it does most faithfully copy the maths as presented in the paper. NSB (and some of the other bayesian estimators) uses integration: this could be a major source of divergence. To avoid complex rewriting to avoid overflow (as found in other NSB implementations) I used Julia's arbitrary precision floats. Integrating over these could easily introduce some of the deviation that you see. The question then becomes was the lack of precision considered by the authors when presenting their maths. My guess is, it was not, and therefore deviations are due to floating point imperfections. I noticed while writing the code that these do add up very quickly.
Also, at small sample sizes, entropy estimators vary tremendously from run to run. 1000 iterations may not be enough for stability at very small sizes. I note that as sample size increases, the tables converge. BN is interesting though. It is specifically for small datasets (I will add this information into the docs): divergence increases as more samples are added, which is the behaviour I think I would expect to see and at small sizes we again have the stability problem. Looking at the code in frequentist.jl
I can't immediately see an error compared against the formula in the docs.
I will add a check for sample size on NSB
< 64, though that sounds more like a bug. I'm not sure what's happening with the Unseen, please give me a few days to figure it out.
I've found the error in NSB: whenever the input has no coincidences (ie everything is seen only once), it fails. I need to figure out why that is happening and how to fix it. Thanks for the spot!
The original author's implementation of NSB returns NaN when there are no coincidences in the data, the python implementation does not. I have opted to go with the author's decision and return NaN. Unseen had a bug which has now been fixed. I've been unable to find any errors in the remaining estimators by comparing against other implementations, so I'm closing here, though the question remains open as to why Rodriguez reports some differing results.
Hi @kellino
I ran your updated code against my script. For Unseen, the results are more realistic now, but still much worse than the other estimators and not what Contreras Rodríguez et al. report in their paper. It also fails some times:
[nluetts@fedora DiscreteEntropy-JOSS-Review]$ julia --project --load comparison.jl -e "main()"
###############################################################################
...
Running comparison for Unseen ...
Sample size: 8
Reference mean entropy: 5.202, DiscreteEntropy.jl: 0.06876407943033852
Sample size: 16
Reference mean entropy: 6.751, DiscreteEntropy.jl: 0.16074670482604692
Sample size: 32
Reference mean entropy: 7.739, DiscreteEntropy.jl: 0.3273587002393337
Sample size: 64
Reference mean entropy: 7.925, DiscreteEntropy.jl: 0.5499381518155609
Sample size: 128
Reference mean entropy: 7.945, DiscreteEntropy.jl: 0.9208798178505762
Sample size: 256
Reference mean entropy: 7.957, DiscreteEntropy.jl: 1.4577827444473084
Sample size: 512
Reference mean entropy: 7.95, DiscreteEntropy.jl: 2.162698415707029
Sample size: 1024
Reference mean entropy: 7.948, DiscreteEntropy.jl: 2.9027093645486657
Sample size: 2048
Error: unable to factorize the basis matrix (1)
ERROR running estimation: MathOptInterface.ResultIndexBoundsError{MathOptInterface.VariablePrimal}(MathOptInterface.VariablePrimal(1), 0)
Sample size: 4096
Error: unable to factorize the basis matrix (1)
ERROR running estimation: MathOptInterface.ResultIndexBoundsError{MathOptInterface.VariablePrimal}(MathOptInterface.VariablePrimal(1), 0)
Sample size: 8192
Error: unable to factorize the basis matrix (1)
ERROR running estimation: MathOptInterface.ResultIndexBoundsError{MathOptInterface.VariablePrimal}(MathOptInterface.VariablePrimal(1), 0)
Sample size: 16384
Reference mean entropy: 7.971, DiscreteEntropy.jl: 4.91103173101616
###############################################################################
┌───────────┬───────┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ Estimator │ 8 │ 16 │ 32 │ 64 │ 128 │ 256 │ 512 │ 1024 │ 2048 │ 4096 │ 8192 │ 16384 │
├───────────┼───────┼───────┼───────┼───────┼───────┼────────┼───────┼───────┼───────┼───────┼───────┼───────┤
...
│ Unseen │ 0.069 │ 0.161 │ 0.327 │ 0.55 │ 0.921 │ 1.458 │ 2.163 │ 2.903 │ NaN │ NaN │ NaN │ 4.911 │
...
└───────────┴───────┴───────┴───────┴───────┴───────┴────────┴───────┴───────┴───────┴───────┴───────┴───────┘
@nluetts, now, this is getting interesting. I looked at the Rodriguez paper again and they state for Unseen (on page 4) "the authors propose to compute its value algorithmically". This is a strange sentence and makes me wonder if they manually computed values (and perhaps got it wrong?). It's true that unseen source code doesn't seem to be available as a repo anywhere that I could find, but it is included as an appendix in the paper. I worked from this appendix.
As for the error when sample size gets too big... I'm not sure what the issue is there. Many of these estimators do come with quite strong assumptions, so perhaps Unseen has some upper limit. I'll investigate this further.
Also with regard to the discrepancies for SG, SHR, JEF, LAP, MIN, BUB, maybe it is a good idea to contact the authors and ask them what could have happened here? Maybe it is just my testing implementation which is not exactly what they did, but then I wonder why the comparison succeeds for the remaining estimators.
I'll reach out, but I hope this submission won't be blocked waiting for their reply... there was an error in BUB which was recently fixed in a comment, this might reduce the discrepency there. The others all seem to be those found in the R library, so perhaps there was a mistake in their R code?
Before I forget it (again), you have to get rid of these two lines:
When I run my script, the output is quite enormous ...
Ouch, I must have left those in for debugging. I'll remove now
I'll reach out, but I hope this submission won't be blocked waiting for their reply... there was an error in BUB which was recently fixed in a comment, this might reduce the discrepency there. The others all seem to be those found in the R library, so perhaps there was a mistake in their R code?
No, this does not need to block the submission. Since the R library gives the same results, it should be fine.
Calls to print() removed in 6963a95cf9ba19ad8fc5e8405bef402e07dde25b
entropy
does not seem to include automatic tests, though. At least I as a non R-programmer are not able to spot them (https://github.com/cran/entropy/tree/master). But I assume they tested well enough manually.
@kellino OK, so my last open question is Unseen
.
You said there was a bug which was fixed. After the fix, the negative entropy values disappeared and the results are more reasonable. The problem is, that your testcase
did not catch this bug. It succeeded already before the bugfix. So this test alone does not seem to be sufficient.
I think the paper from Valiant & Valiant can be used for another testcase, though perhaps not a quantitative one, since they only seem to report graphs, not numbers.
The first plot in their Figure 1 shows how the error of Unseen evolves with sample size for a discrete uniform distribution from 1 to 1000. In Julia, I can sample from this distribution like so:
julia> rand(1:1000, 100) # 100 samples
So to estimate the entropy:
julia> estimate_h(from_samples(svector(rand(1:1000, 100))), Estimator)
If I calculate $e$ to the power of the entropy, I should get the number of distinct items in the discrete distribution. So if I oversample, the estimator should not have a problem to find that there are 1000 distinct numbers for this discrete distribution. And indeed:
julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), PYM) |> exp
1007.6118832749003
julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), NSB) |> exp
981.4373913617699
julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), ChaoShen) |> exp
954.4725165955898
For less samples, the estimates deviate more, but are still close enough:
julia> using Statistics
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), PYM) |> exp for _ in 1:500)
1381.6990603254876
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), PYM) |> exp for _ in 1:500)
1379.908995113134
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), NSB) |> exp for _ in 1:500)
629.4125878097738
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), NSB) |> exp for _ in 1:500)
629.793778092218
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), ChaoShen) |> exp for _ in 1:500)
1094.844219101877
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), ChaoShen) |> exp for _ in 1:500)
1097.9925117939667
Now Unseen
:
julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), Unseen) |> exp
12.752808081795651
julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), Unseen) |> exp
12.527995830000302
julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), Unseen) |> exp
12.812724640864127
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), Unseen) |> exp for _ in 1:500)
2.7727911829598364
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), Unseen) |> exp for _ in 1:500)
2.7721745172018806
This result is very far from the ground truth and far from the claim in the paper that the error of the estimate should approach zero in case of this distribution and 1000 samples (upper left plot in Figure 1).
Can you check your code again and test against more examples with the original matlab code?
[...] It's true that unseen source code doesn't seem to be available as a repo anywhere that I could find, but it is included as an appendix in the paper. I worked from this appendix. [...]
In case you have not seen it:
https://theory.stanford.edu/~valiant/code.html
(Could also be added as a reference to the paper.)
@nluetts I've added more tests to catch the previously uncaught bug, but I'm not sure what's happening with the the most recent examples you've posted. It'll probably take me a few days to solve this, as time is a bit limited at the moment.
@nluetts a genuinely stupid bug that took me a while to actually find, but only revealed itself when the sample size / support size ration became relatively large. This produces a much saner answer now.
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 10000))), Unseen) |> exp for _ in 1:500)
990.5554343074119
@kellino Great! And now it also still works quite well for less samples, as advertised in the paper:
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), Unseen) |> exp for _ in 1:500)
987.5433915762957
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), Unseen) |> exp for _ in 1:500)
987.8677858039398
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 100))), Unseen) |> exp for _ in 1:500)
1074.078845471124
julia> mean(estimate_h(from_samples(svector(rand(1:1000, 100))), Unseen) |> exp for _ in 1:500)
1027.9313289893134
Hi @kellino :wave:
this is Nils, I am opening this in response to the review over at JOSS (https://github.com/openjournals/joss-reviews/issues/7334).
From your JOSS paper draft I got the impression that the paper from Contreras Rodrı́guez et al. (https://doi.org/10.3390/e23050561) is quite central for your package, so I thought testing your package against the estimated entropies published there would be a good thing to do (which does not yet seem to be covered by your test cases, although they are quite comprehensive otherwise).
In the paper, they estimate entropies for byte sequences they generated on a Linux machine with
/dev/urandom
, so I did the same and tried to reproduce their Table 1 withDiscreteEntropy.jl
.I used this Julia environment:
And the following script to do the comparison:
The script generates the following output (scroll to the end to find the two relevant tables):
The first table are the mean estimated entropies from
DiscreteEntropy.jl
, the second table are the deviations to Table 1 from the paper (the mean is usually caluclated from 1000 estimates, just for BUB and NSB I recuded this number because they were much slower than the other estimators). The rows are the different estimators and the columns the different sample sizes.The entries which say
NaN
did not succeed to run. For CDM, I simple could not figure out what it corresponds to in your package, thus these values are missing, and for the rest you can find the error messages in the output:NSB
threw aRoots.ConvergenceFailed
error for small sample sizes < 64 andUnseen
threw aDimensionMismatch
error for sample size > 32.Otherwise you can see that the comparison succeeds for ChaoShen (CS), MaxiumLikelihood (ML), MillerMadow (MM), Zhang, ChaoWangJost (CJ) and Schürmann (SHU). For the rest, however, there are noticeable deviations.
I saw in your test cases that you compare against the R package
entropy
which works just fine, so I also tried using R to compare against the paper (in a reduced fashion, only for a sample size of 8 and the estimators which are available; I am not an R programmer ...):Which yields:
These are (almost) the same values that
DiscreteEntropy.jl
produces! But they still disagree with the paper, of course.It might well be that I made a mistake in the comparsion. Can you check this and comment on the source of the deviations?