kellino / DiscreteEntropy.jl

Estimation of Shannon Entropy for Discrete Random Variables
MIT License
2 stars 1 forks source link

Testing against Contreras Rodríguez et al. 2021 #11

Closed nluetts closed 2 weeks ago

nluetts commented 3 weeks ago

Hi @kellino :wave:

this is Nils, I am opening this in response to the review over at JOSS (https://github.com/openjournals/joss-reviews/issues/7334).

From your JOSS paper draft I got the impression that the paper from Contreras Rodrı́guez et al. (https://doi.org/10.3390/e23050561) is quite central for your package, so I thought testing your package against the estimated entropies published there would be a good thing to do (which does not yet seem to be covered by your test cases, although they are quite comprehensive otherwise).

In the paper, they estimate entropies for byte sequences they generated on a Linux machine with /dev/urandom, so I did the same and tried to reproduce their Table 1 with DiscreteEntropy.jl.

I used this Julia environment:

[nluetts@fedora DiscreteEntropy-JOSS-Review]$ julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.5 (2024-08-27)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(DiscreteEntropy-JOSS-Review) pkg> status
Status `/var/home/nluetts/Repos/DiscreteEntropy-JOSS-Review/Project.toml`
  [940648b6] DiscreteEntropy v0.2.0
  [08abe8d2] PrettyTables v2.4.0
  [10745b16] Statistics v1.10.0

And the following script to do the comparison:

using DiscreteEntropy
using Statistics
using PrettyTables

function main()
    results = Dict{String,Vector{Float64}}()
    for est_label in keys(REF_VALUES)
        println("###############################################################################")
        println()
        est = ESTIMATORS[est_label]
        if isnothing(est)
            @warn "No estimator for \"$(est_label)\", skipping ..."
            # so that results table is not missing a row from the paper
            results[est_label] = repeat([NaN64], 12)
            continue
        end

        # less draws for slower estimators
        draws = est == BUB ? 100 : 1000
        draws = est == NSB ? 25 : draws

        println("Running comparison for $est ...")

        mean_entropies = Float64[]
        for exp in 3:14
            println("Sample size: $(2^exp)")
            try
                entropy = compare_contreras_rodriguez(est, exp; draws=draws)
                push!(mean_entropies, entropy)
                # index in list of entropy values = exp - 2
                # because Contreras Rodríguez et al. starts with
                # sample size 2^3 = 8
                ref_entropy = REF_VALUES[est_label][exp-2]
                println("Reference mean entropy: $ref_entropy, DiscreteEntropy.jl: $entropy")
            catch e
                println("ERROR running estimation: $e")
                push!(mean_entropies, NaN64)
            end
        end
        results[est_label] = mean_entropies
    end

    labels = SORTED_ESTIMATOR_LABELS
    n = length(labels)
    # since PrettyTables cannot transpose, we have
    # to do some work ourselves here ... 
    estimates = [round(results[k][exp-2], digits=3) for exp in 3:14 for k in labels]
    deviations = [
        round(results[k][exp-2] - REF_VALUES[k][exp-2], digits=2)
        for exp in 3:14 for k in labels
    ]
    raw_table_results = hcat(labels, reshape(estimates, n, :))
    raw_table_deviations = hcat(labels, reshape(deviations, n, :))
    header = vcat(["Estimator"], [2^n for n in 3:14])

    # print out tables
    pretty_table(raw_table_results; header=header),
    pretty_table(raw_table_deviations; header=header)
end

# From Contreras Rodríguez et al.,
# Entropy 2021, 23, 561. https://doi.org/10.3390/e23050561
# Table 2: "Estimated mean of estimates for byte samples."
REF_VALUES = Dict(
    "CS" => [6.142, 7.635, 8.251, 8.126, 8.064, 8.099, 8.116, 8.003, 7.914, 7.940, 7.963, 7.974],
    "ML" => [2.969, 3.943, 4.880, 5.768, 6.550, 7.178, 7.591, 7.809, 7.908, 7.955, 7.977, 7.989],
    "MM" => [3.590, 4.598, 5.536, 6.397, 7.113, 7.633, 7.902, 7.985, 7.998, 7.999, 8.000, 8.000],
    "SG" => [3.999, 4.470, 5.136, 5.884, 6.598, 7.194, 7.595, 7.809, 7.908, 7.955, 7.977, 7.989],
    "SHR" => [7.940, 7.971, 7.982, 7.991, 7.995, 7.998, 7.999, 7.999, 8.000, 8.000, 8.000, 8.000],
    "JEF" => [7.947, 7.902, 7.834, 7.751, 7.690, 7.696, 7.767, 7.854, 7.919, 7.957, 7.978, 7.989],
    "LAP" => [7.983, 7.969, 7.943, 7.904, 7.859, 7.832, 7.842, 7.883, 7.928, 7.960, 7.979, 7.989],
    "MIN" => [5.048, 5.385, 5.835, 6.357, 6.875, 7.326, 7.643, 7.823, 7.912, 7.956, 7.978, 7.989],
    "BUB" => [7.008, 7.021, 7.476, 7.732, 7.868, 8.061, 7.896, 7.971, 7.983, 7.985, 7.985, 7.985],
    "NSB" => [6.427, 7.065, 7.474, 7.726, 7.855, 7.922, 7.953, 7.969, 7.977, 7.981, 7.983, 7.984],
    "Unseen" => [5.202, 6.751, 7.739, 7.925, 7.945, 7.957, 7.950, 7.948, 7.953, 7.960, 7.966, 7.971],
    "Zhang" => [3.690, 4.696, 5.627, 6.478, 7.178, 7.674, 7.916, 7.979, 7.985, 7.985, 7.985, 7.985],
    "CJ" => [5.349, 7.028, 8.089, 8.093, 8.004, 7.996, 7.987, 7.985, 7.985, 7.985, 7.985, 7.985],
    "UnveilJ" => [3.036, 4.124, 5.316, 6.645, 7.749, 7.95, 7.855, 7.832, 7.894, 7.940, 7.963, 7.974],
    "BN" => [2.963, 4.433, 5.815, 6.981, 7.726, 7.889, 7.41, 6.662, 6.115, 5.829, 5.686, 5.616],
    "GSB" => [4.646, 5.613, 6.464, 7.182, 7.670, 7.921, 7.981, 7.985, 7.985, 7.985, 7.985, 7.985],
    "SHU" => [4.347, 5.329, 6.210, 6.978, 7.541, 7.868, 7.972, 7.985, 7.985, 7.985, 7.985, 7.985],
    "CDM" => [6.165, 6.908, 7.391, 7.689, 7.844, 7.922, 7.961, 7.981, 7.990, 7.995, 7.998, 7.999],
)

# same sorting as in paper (not preserved in Dict `REF_VALUES`)
SORTED_ESTIMATOR_LABELS = ["CS", "ML", "MM", "SG", "SHR", "JEF", "LAP", "MIN", "BUB", "NSB", "Unseen", "Zhang", "CJ", "UnveilJ", "BN", "GSB", "SHU", "CDM"]

ESTIMATORS = Dict(
    "CS" => ChaoShen,
    "ML" => MaximumLikelihood,
    "MM" => MillerMadow,
    "SG" => SchurmannGrassberger,
    "SHR" => Shrink,
    "JEF" => Jeffrey,
    "LAP" => LaPlace,
    "MIN" => Minimax,
    "BUB" => BUB,
    "NSB" => NSB,
    "Unseen" => Unseen,
    "Zhang" => Zhang,
    # Contreras Rodríguez et al. seem to have a typo here, CJ is not mentioned
    # in Table 1, only CWJ, so I am not sure this mapping is correct.
    "CJ" => ChaoWangJost,
    "UnveilJ" => JackknifeMLE,
    "BN" => Bonachela,
    "GSB" => Grassberger,
    "SHU" => Schurmann,
    "CDM" => nothing # Cannot figure out which one this is
)

function read_dev_urandom_bytes(n::UInt)
    bytes = open("/dev/urandom", "r") do io
        read(io, n)
    end
    bytes
end

read_dev_urandom_bytes(n) = read_dev_urandom_bytes(convert(UInt, n))

function compare_contreras_rodriguez(::Type{T}, bytes_exponent::UInt8; draws=1000) where {T<:AbstractEstimator}
    @assert 3 <= bytes_exponent <= 14 "No reference values available for sample size $(2^bytes_exponent)"
    sample_size = 2^bytes_exponent
    mean(
        estimate_h(read_dev_urandom_bytes(sample_size) |> svector |> from_samples, T)
        for _ in 1:draws
    ) |> to_bits
end

compare_contreras_rodriguez(::Type{T}, bytes_exponent; draws=1000) where {T<:AbstractEstimator} =
    compare_contreras_rodriguez(T, convert(UInt8, bytes_exponent), draws=draws)

The script generates the following output (scroll to the end to find the two relevant tables):

[nluetts@fedora DiscreteEntropy-JOSS-Review]$ julia --project --load comparison.jl -e "main()"
###############################################################################

Running comparison for BUB ...
Sample size: 8
Reference mean entropy: 7.008, DiscreteEntropy.jl: 3.614995850373481
Sample size: 16
Reference mean entropy: 7.021, DiscreteEntropy.jl: 4.7899950608286606
Sample size: 32
Reference mean entropy: 7.476, DiscreteEntropy.jl: 6.796319151130138
Sample size: 64
Reference mean entropy: 7.732, DiscreteEntropy.jl: 7.572799602723234
Sample size: 128
Reference mean entropy: 7.868, DiscreteEntropy.jl: 7.9792128580842085
Sample size: 256
Reference mean entropy: 8.061, DiscreteEntropy.jl: 8.058072331938295
Sample size: 512
Reference mean entropy: 7.896, DiscreteEntropy.jl: 8.001327375927534
Sample size: 1024
Reference mean entropy: 7.971, DiscreteEntropy.jl: 7.99473597067706
Sample size: 2048
Reference mean entropy: 7.983, DiscreteEntropy.jl: 8.000663690919412
Sample size: 4096
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.9938103137202425
Sample size: 8192
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999977554911958
Sample size: 16384
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000005950701526
###############################################################################

Running comparison for ChaoWangJost ...
Sample size: 8
Reference mean entropy: 5.349, DiscreteEntropy.jl: 5.360278389970844
Sample size: 16
Reference mean entropy: 7.028, DiscreteEntropy.jl: 7.041184521071105
Sample size: 32
Reference mean entropy: 8.089, DiscreteEntropy.jl: 8.118867698781537
Sample size: 64
Reference mean entropy: 8.093, DiscreteEntropy.jl: 8.08072337036545
Sample size: 128
Reference mean entropy: 8.004, DiscreteEntropy.jl: 8.01966232891067
Sample size: 256
Reference mean entropy: 7.996, DiscreteEntropy.jl: 8.008443797783935
Sample size: 512
Reference mean entropy: 7.987, DiscreteEntropy.jl: 8.000142338203254
Sample size: 1024
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000489112501155
Sample size: 2048
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.00003292715623
Sample size: 4096
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999791811121944
Sample size: 8192
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999968198405732
Sample size: 16384
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.9999736567289865
###############################################################################

Running comparison for Minimax ...
Sample size: 8
Reference mean entropy: 5.048, DiscreteEntropy.jl: 2.9761034489686033
Sample size: 16
Reference mean entropy: 5.385, DiscreteEntropy.jl: 3.9442625156600024
Sample size: 32
Reference mean entropy: 5.835, DiscreteEntropy.jl: 4.89343609826837
Sample size: 64
Reference mean entropy: 6.357, DiscreteEntropy.jl: 5.783872751201355
Sample size: 128
Reference mean entropy: 6.875, DiscreteEntropy.jl: 6.565523196303405
Sample size: 256
Reference mean entropy: 7.326, DiscreteEntropy.jl: 7.196160029923576
Sample size: 512
Reference mean entropy: 7.643, DiscreteEntropy.jl: 7.606140946314727
Sample size: 1024
Reference mean entropy: 7.823, DiscreteEntropy.jl: 7.819207596467573
Sample size: 2048
Reference mean entropy: 7.912, DiscreteEntropy.jl: 7.912123086208508
Sample size: 4096
Reference mean entropy: 7.956, DiscreteEntropy.jl: 7.956119491898474
Sample size: 8192
Reference mean entropy: 7.978, DiscreteEntropy.jl: 7.978118017008263
Sample size: 16384
Reference mean entropy: 7.989, DiscreteEntropy.jl: 7.988922826292928
###############################################################################

Running comparison for Grassberger ...
Sample size: 8
Reference mean entropy: 4.646, DiscreteEntropy.jl: 4.475872575973066
Sample size: 16
Reference mean entropy: 5.613, DiscreteEntropy.jl: 5.395986343698666
Sample size: 32
Reference mean entropy: 6.464, DiscreteEntropy.jl: 6.260004822058096
Sample size: 64
Reference mean entropy: 7.182, DiscreteEntropy.jl: 6.980796744618755
Sample size: 128
Reference mean entropy: 7.67, DiscreteEntropy.jl: 7.5182629313850935
Sample size: 256
Reference mean entropy: 7.921, DiscreteEntropy.jl: 7.827305281488433
Sample size: 512
Reference mean entropy: 7.981, DiscreteEntropy.jl: 7.961604461074284
Sample size: 1024
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.995547205459536
Sample size: 2048
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000232539142349
Sample size: 4096
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000359491911881
Sample size: 8192
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999967625418502
Sample size: 16384
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.9998609625723525
###############################################################################

Running comparison for Jeffrey ...
Sample size: 8
Reference mean entropy: 7.947, DiscreteEntropy.jl: 2.972254402351613
Sample size: 16
Reference mean entropy: 7.902, DiscreteEntropy.jl: 3.9547463784273966
Sample size: 32
Reference mean entropy: 7.834, DiscreteEntropy.jl: 4.8988101907857375
Sample size: 64
Reference mean entropy: 7.751, DiscreteEntropy.jl: 5.796691217187817
Sample size: 128
Reference mean entropy: 7.69, DiscreteEntropy.jl: 6.595211026669902
Sample size: 256
Reference mean entropy: 7.696, DiscreteEntropy.jl: 7.2428089132809506
Sample size: 512
Reference mean entropy: 7.767, DiscreteEntropy.jl: 7.6545957368144855
Sample size: 1024
Reference mean entropy: 7.854, DiscreteEntropy.jl: 7.844389558269971
Sample size: 2048
Reference mean entropy: 7.919, DiscreteEntropy.jl: 7.918733289952589
Sample size: 4096
Reference mean entropy: 7.957, DiscreteEntropy.jl: 7.957507646842208
Sample size: 8192
Reference mean entropy: 7.978, DiscreteEntropy.jl: 7.978231591590087
Sample size: 16384
Reference mean entropy: 7.989, DiscreteEntropy.jl: 7.988930604764357
###############################################################################

Running comparison for LaPlace ...
Sample size: 8
Reference mean entropy: 7.983, DiscreteEntropy.jl: 2.9736909829661333
Sample size: 16
Reference mean entropy: 7.969, DiscreteEntropy.jl: 3.9533018294946505
Sample size: 32
Reference mean entropy: 7.943, DiscreteEntropy.jl: 4.903365419183482
Sample size: 64
Reference mean entropy: 7.904, DiscreteEntropy.jl: 5.808311331087142
Sample size: 128
Reference mean entropy: 7.859, DiscreteEntropy.jl: 6.621930848735704
Sample size: 256
Reference mean entropy: 7.832, DiscreteEntropy.jl: 7.27588208048709
Sample size: 512
Reference mean entropy: 7.842, DiscreteEntropy.jl: 7.692674035956105
Sample size: 1024
Reference mean entropy: 7.883, DiscreteEntropy.jl: 7.869540223819323
Sample size: 2048
Reference mean entropy: 7.928, DiscreteEntropy.jl: 7.927973374245435
Sample size: 4096
Reference mean entropy: 7.96, DiscreteEntropy.jl: 7.959961271968038
Sample size: 8192
Reference mean entropy: 7.979, DiscreteEntropy.jl: 7.9786682763551084
Sample size: 16384
Reference mean entropy: 7.989, DiscreteEntropy.jl: 7.989102788616696
###############################################################################

Running comparison for Shrink ...
Sample size: 8
Reference mean entropy: 7.94, DiscreteEntropy.jl: 2.9808091479101586
Sample size: 16
Reference mean entropy: 7.971, DiscreteEntropy.jl: 3.953894179273727
Sample size: 32
Reference mean entropy: 7.982, DiscreteEntropy.jl: 4.911835316261442
Sample size: 64
Reference mean entropy: 7.991, DiscreteEntropy.jl: 5.821070473660774
Sample size: 128
Reference mean entropy: 7.995, DiscreteEntropy.jl: 6.655003767983381
Sample size: 256
Reference mean entropy: 7.998, DiscreteEntropy.jl: 7.329850657777553
Sample size: 512
Reference mean entropy: 7.999, DiscreteEntropy.jl: 7.758420052803192
Sample size: 1024
Reference mean entropy: 7.999, DiscreteEntropy.jl: 7.929509738707598
Sample size: 2048
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.966671297879019
Sample size: 4096
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.983602410097988
Sample size: 8192
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.992357342562944
Sample size: 16384
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.9966348604391735
###############################################################################

Running comparison for SchurmannGrassberger ...
Sample size: 8
Reference mean entropy: 3.999, DiscreteEntropy.jl: 2.971736967454633
Sample size: 16
Reference mean entropy: 4.47, DiscreteEntropy.jl: 3.944955929980586
Sample size: 32
Reference mean entropy: 5.136, DiscreteEntropy.jl: 4.8825236480803875
Sample size: 64
Reference mean entropy: 5.884, DiscreteEntropy.jl: 5.7679317080531565
Sample size: 128
Reference mean entropy: 6.598, DiscreteEntropy.jl: 6.549177687335087
Sample size: 256
Reference mean entropy: 7.194, DiscreteEntropy.jl: 7.176950412445397
Sample size: 512
Reference mean entropy: 7.595, DiscreteEntropy.jl: 7.591266466353353
Sample size: 1024
Reference mean entropy: 7.809, DiscreteEntropy.jl: 7.808764456449525
Sample size: 2048
Reference mean entropy: 7.908, DiscreteEntropy.jl: 7.908099032251887
Sample size: 4096
Reference mean entropy: 7.955, DiscreteEntropy.jl: 7.954910030058388
Sample size: 8192
Reference mean entropy: 7.977, DiscreteEntropy.jl: 7.977386074854952
Sample size: 16384
Reference mean entropy: 7.989, DiscreteEntropy.jl: 7.988781719453369
###############################################################################

Running comparison for Zhang ...
Sample size: 8
Reference mean entropy: 3.69, DiscreteEntropy.jl: 3.70057718590882
Sample size: 16
Reference mean entropy: 4.696, DiscreteEntropy.jl: 4.708114791136625
Sample size: 32
Reference mean entropy: 5.627, DiscreteEntropy.jl: 5.640825481842562
Sample size: 64
Reference mean entropy: 6.478, DiscreteEntropy.jl: 6.489832547884759
Sample size: 128
Reference mean entropy: 7.178, DiscreteEntropy.jl: 7.185693307022013
Sample size: 256
Reference mean entropy: 7.674, DiscreteEntropy.jl: 7.685618818498393
Sample size: 512
Reference mean entropy: 7.916, DiscreteEntropy.jl: 7.931552076659486
Sample size: 1024
Reference mean entropy: 7.979, DiscreteEntropy.jl: 7.995478619546022
Sample size: 2048
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.00011685326433
Sample size: 4096
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.0002027748647
Sample size: 8192
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999933474194657
Sample size: 16384
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999973102305582
###############################################################################

Running comparison for Bonachela ...
Sample size: 8
Reference mean entropy: 2.963, DiscreteEntropy.jl: 3.258906518604906
Sample size: 16
Reference mean entropy: 4.433, DiscreteEntropy.jl: 5.008529414146986
Sample size: 32
Reference mean entropy: 5.815, DiscreteEntropy.jl: 6.828142608055729
Sample size: 64
Reference mean entropy: 6.981, DiscreteEntropy.jl: 8.479980699266957
Sample size: 128
Reference mean entropy: 7.726, DiscreteEntropy.jl: 9.7162166211824
Sample size: 256
Reference mean entropy: 7.889, DiscreteEntropy.jl: 10.242244993398158
Sample size: 512
Reference mean entropy: 7.41, DiscreteEntropy.jl: 9.956818207508285
Sample size: 1024
Reference mean entropy: 6.662, DiscreteEntropy.jl: 9.221113291455794
Sample size: 2048
Reference mean entropy: 6.115, DiscreteEntropy.jl: 8.632933243927424
Sample size: 4096
Reference mean entropy: 5.829, DiscreteEntropy.jl: 8.316754014736388
Sample size: 8192
Reference mean entropy: 5.686, DiscreteEntropy.jl: 8.158346741992037
Sample size: 16384
Reference mean entropy: 5.616, DiscreteEntropy.jl: 8.079233029603056
###############################################################################

Running comparison for MillerMadow ...
Sample size: 8
Reference mean entropy: 3.59, DiscreteEntropy.jl: 3.592985854164966
Sample size: 16
Reference mean entropy: 4.598, DiscreteEntropy.jl: 4.5973992117016325
Sample size: 32
Reference mean entropy: 5.536, DiscreteEntropy.jl: 5.5378266814833585
Sample size: 64
Reference mean entropy: 6.397, DiscreteEntropy.jl: 6.3944223125125115
Sample size: 128
Reference mean entropy: 7.113, DiscreteEntropy.jl: 7.116156496268463
Sample size: 256
Reference mean entropy: 7.633, DiscreteEntropy.jl: 7.6320174448968015
Sample size: 512
Reference mean entropy: 7.902, DiscreteEntropy.jl: 7.903183965765673
Sample size: 1024
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.984768837092095
Sample size: 2048
Reference mean entropy: 7.998, DiscreteEntropy.jl: 7.99727572744489
Sample size: 4096
Reference mean entropy: 7.999, DiscreteEntropy.jl: 7.999653223504991
Sample size: 8192
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.9998848638654705
Sample size: 16384
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.999995299779101
###############################################################################

Running comparison for MaximumLikelihood ...
Sample size: 8
Reference mean entropy: 2.969, DiscreteEntropy.jl: 2.97425000000003
Sample size: 16
Reference mean entropy: 3.943, DiscreteEntropy.jl: 3.9426390976555274
Sample size: 32
Reference mean entropy: 4.88, DiscreteEntropy.jl: 4.877901691010165
Sample size: 64
Reference mean entropy: 5.768, DiscreteEntropy.jl: 5.767319630991382
Sample size: 128
Reference mean entropy: 6.55, DiscreteEntropy.jl: 6.5509600442318
Sample size: 256
Reference mean entropy: 7.178, DiscreteEntropy.jl: 7.1746188465355125
Sample size: 512
Reference mean entropy: 7.591, DiscreteEntropy.jl: 7.591458230569615
Sample size: 1024
Reference mean entropy: 7.809, DiscreteEntropy.jl: 7.808054042355388
Sample size: 2048
Reference mean entropy: 7.908, DiscreteEntropy.jl: 7.907547838590217
Sample size: 4096
Reference mean entropy: 7.955, DiscreteEntropy.jl: 7.954778160636422
Sample size: 8192
Reference mean entropy: 7.977, DiscreteEntropy.jl: 7.977414005402187
Sample size: 16384
Reference mean entropy: 7.989, DiscreteEntropy.jl: 7.988727784134597
###############################################################################

Running comparison for Schurmann ...
Sample size: 8
Reference mean entropy: 4.347, DiscreteEntropy.jl: 4.406152515153422
Sample size: 16
Reference mean entropy: 5.329, DiscreteEntropy.jl: 5.371441079483403
Sample size: 32
Reference mean entropy: 6.21, DiscreteEntropy.jl: 6.258362134412951
Sample size: 64
Reference mean entropy: 6.978, DiscreteEntropy.jl: 7.009485345023251
Sample size: 128
Reference mean entropy: 7.541, DiscreteEntropy.jl: 7.571587134463173
Sample size: 256
Reference mean entropy: 7.868, DiscreteEntropy.jl: 7.890860309539617
Sample size: 512
Reference mean entropy: 7.972, DiscreteEntropy.jl: 7.988336933785518
Sample size: 1024
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.99878737682248
Sample size: 2048
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000104914446608
Sample size: 4096
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.99987694067543
Sample size: 8192
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000120087586403
Sample size: 16384
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000031759411705
###############################################################################

Running comparison for ChaoShen ...
Sample size: 8
Reference mean entropy: 6.142, DiscreteEntropy.jl: 6.175011056668875
Sample size: 16
Reference mean entropy: 7.635, DiscreteEntropy.jl: 7.625722852594498
Sample size: 32
Reference mean entropy: 8.251, DiscreteEntropy.jl: 8.324936762316629
Sample size: 64
Reference mean entropy: 8.126, DiscreteEntropy.jl: 8.130934934222017
Sample size: 128
Reference mean entropy: 8.064, DiscreteEntropy.jl: 8.0807378701635
Sample size: 256
Reference mean entropy: 8.099, DiscreteEntropy.jl: 8.114032732196659
Sample size: 512
Reference mean entropy: 8.116, DiscreteEntropy.jl: 8.130757598494167
Sample size: 1024
Reference mean entropy: 8.003, DiscreteEntropy.jl: 8.017750504080228
Sample size: 2048
Reference mean entropy: 7.914, DiscreteEntropy.jl: 7.929019945213538
Sample size: 4096
Reference mean entropy: 7.94, DiscreteEntropy.jl: 7.9546912060882935
Sample size: 8192
Reference mean entropy: 7.963, DiscreteEntropy.jl: 7.977376272281654
Sample size: 16384
Reference mean entropy: 7.974, DiscreteEntropy.jl: 7.988768967507508
###############################################################################

Running comparison for NSB ...
Sample size: 8
ERROR running estimation: Roots.ConvergenceFailed("Algorithm failed to converge")
Sample size: 16
ERROR running estimation: Roots.ConvergenceFailed("Algorithm failed to converge")
Sample size: 32
ERROR running estimation: Roots.ConvergenceFailed("Algorithm failed to converge")
Sample size: 64
Reference mean entropy: 7.726, DiscreteEntropy.jl: 5.819025242186299
Sample size: 128
Reference mean entropy: 7.855, DiscreteEntropy.jl: 6.650152910029301
Sample size: 256
Reference mean entropy: 7.922, DiscreteEntropy.jl: 7.327239500418171
Sample size: 512
Reference mean entropy: 7.953, DiscreteEntropy.jl: 7.75411110054627
Sample size: 1024
Reference mean entropy: 7.969, DiscreteEntropy.jl: 7.940917656786869
Sample size: 2048
Reference mean entropy: 7.977, DiscreteEntropy.jl: 7.967618652904784
Sample size: 4096
Reference mean entropy: 7.981, DiscreteEntropy.jl: 7.975301195389316
Sample size: 8192
Reference mean entropy: 7.983, DiscreteEntropy.jl: 7.983648977913771
Sample size: 16384
Reference mean entropy: 7.984, DiscreteEntropy.jl: 7.990635829134803
###############################################################################

Running comparison for JackknifeMLE ...
Sample size: 8
Reference mean entropy: 3.036, DiscreteEntropy.jl: 2.7853946642449614
Sample size: 16
Reference mean entropy: 4.124, DiscreteEntropy.jl: 3.8558508962325395
Sample size: 32
Reference mean entropy: 5.316, DiscreteEntropy.jl: 4.835834971119449
Sample size: 64
Reference mean entropy: 6.645, DiscreteEntropy.jl: 5.745386104854185
Sample size: 128
Reference mean entropy: 7.749, DiscreteEntropy.jl: 6.540863398829251
Sample size: 256
Reference mean entropy: 7.95, DiscreteEntropy.jl: 7.1722380222088455
Sample size: 512
Reference mean entropy: 7.855, DiscreteEntropy.jl: 7.589458477649767
Sample size: 1024
Reference mean entropy: 7.832, DiscreteEntropy.jl: 7.809161934211856
Sample size: 2048
Reference mean entropy: 7.894, DiscreteEntropy.jl: 7.907587677473503
Sample size: 4096
Reference mean entropy: 7.94, DiscreteEntropy.jl: 7.95460261651072
Sample size: 8192
Reference mean entropy: 7.963, DiscreteEntropy.jl: 7.977504117231058
Sample size: 16384
Reference mean entropy: 7.974, DiscreteEntropy.jl: 7.9887301921698
###############################################################################

Running comparison for Unseen ...
Sample size: 8
Reference mean entropy: 5.202, DiscreteEntropy.jl: -1.2256381298995431
Sample size: 16
Reference mean entropy: 6.751, DiscreteEntropy.jl: -0.8932659978524178
Sample size: 32
Reference mean entropy: 7.739, DiscreteEntropy.jl: -0.8418141110014484
Sample size: 64
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 3")
Sample size: 128
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 3")
Sample size: 256
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 4")
Sample size: 512
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 6")
Sample size: 1024
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 9")
Sample size: 2048
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 8")
Sample size: 4096
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 4")
Sample size: 8192
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 4")
Sample size: 16384
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 3")
###############################################################################

┌ Warning: No estimator for "CDM", skipping ...
└ @ Main /var/home/nluetts/Repos/DiscreteEntropy-JOSS-Review/comparison.jl:12
┌───────────┬────────┬────────┬────────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ Estimator │      8 │     16 │     32 │    64 │   128 │    256 │   512 │  1024 │  2048 │  4096 │  8192 │ 16384 │
├───────────┼────────┼────────┼────────┼───────┼───────┼────────┼───────┼───────┼───────┼───────┼───────┼───────┤
│        CS │  6.175 │  7.626 │  8.325 │ 8.131 │ 8.081 │  8.114 │ 8.131 │ 8.018 │ 7.929 │ 7.955 │ 7.977 │ 7.989 │
│        ML │  2.974 │  3.943 │  4.878 │ 5.767 │ 6.551 │  7.175 │ 7.591 │ 7.808 │ 7.908 │ 7.955 │ 7.977 │ 7.989 │
│        MM │  3.593 │  4.597 │  5.538 │ 6.394 │ 7.116 │  7.632 │ 7.903 │ 7.985 │ 7.997 │   8.0 │   8.0 │   8.0 │
│        SG │  2.972 │  3.945 │  4.883 │ 5.768 │ 6.549 │  7.177 │ 7.591 │ 7.809 │ 7.908 │ 7.955 │ 7.977 │ 7.989 │
│       SHR │  2.981 │  3.954 │  4.912 │ 5.821 │ 6.655 │   7.33 │ 7.758 │  7.93 │ 7.967 │ 7.984 │ 7.992 │ 7.997 │
│       JEF │  2.972 │  3.955 │  4.899 │ 5.797 │ 6.595 │  7.243 │ 7.655 │ 7.844 │ 7.919 │ 7.958 │ 7.978 │ 7.989 │
│       LAP │  2.974 │  3.953 │  4.903 │ 5.808 │ 6.622 │  7.276 │ 7.693 │  7.87 │ 7.928 │  7.96 │ 7.979 │ 7.989 │
│       MIN │  2.976 │  3.944 │  4.893 │ 5.784 │ 6.566 │  7.196 │ 7.606 │ 7.819 │ 7.912 │ 7.956 │ 7.978 │ 7.989 │
│       BUB │  3.615 │   4.79 │  6.796 │ 7.573 │ 7.979 │  8.058 │ 8.001 │ 7.995 │ 8.001 │ 7.994 │   8.0 │   8.0 │
│       NSB │    NaN │    NaN │    NaN │ 5.819 │  6.65 │  7.327 │ 7.754 │ 7.941 │ 7.968 │ 7.975 │ 7.984 │ 7.991 │
│    Unseen │ -1.226 │ -0.893 │ -0.842 │   NaN │   NaN │    NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │
│     Zhang │  3.701 │  4.708 │  5.641 │  6.49 │ 7.186 │  7.686 │ 7.932 │ 7.995 │   8.0 │   8.0 │   8.0 │   8.0 │
│        CJ │   5.36 │  7.041 │  8.119 │ 8.081 │  8.02 │  8.008 │   8.0 │   8.0 │   8.0 │   8.0 │   8.0 │   8.0 │
│   UnveilJ │  2.785 │  3.856 │  4.836 │ 5.745 │ 6.541 │  7.172 │ 7.589 │ 7.809 │ 7.908 │ 7.955 │ 7.978 │ 7.989 │
│        BN │  3.259 │  5.009 │  6.828 │  8.48 │ 9.716 │ 10.242 │ 9.957 │ 9.221 │ 8.633 │ 8.317 │ 8.158 │ 8.079 │
│       GSB │  4.476 │  5.396 │   6.26 │ 6.981 │ 7.518 │  7.827 │ 7.962 │ 7.996 │   8.0 │   8.0 │   8.0 │   8.0 │
│       SHU │  4.406 │  5.371 │  6.258 │ 7.009 │ 7.572 │  7.891 │ 7.988 │ 7.999 │   8.0 │   8.0 │   8.0 │   8.0 │
│       CDM │    NaN │    NaN │    NaN │   NaN │   NaN │    NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │
└───────────┴────────┴────────┴────────┴───────┴───────┴────────┴───────┴───────┴───────┴───────┴───────┴───────┘
┌───────────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ Estimator │     8 │    16 │    32 │    64 │   128 │   256 │   512 │  1024 │  2048 │  4096 │  8192 │ 16384 │
├───────────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤
│        CS │  0.03 │ -0.01 │  0.07 │   0.0 │  0.02 │  0.02 │  0.01 │  0.01 │  0.02 │  0.01 │  0.01 │  0.01 │
│        ML │  0.01 │  -0.0 │  -0.0 │  -0.0 │   0.0 │  -0.0 │   0.0 │  -0.0 │  -0.0 │  -0.0 │   0.0 │  -0.0 │
│        MM │   0.0 │  -0.0 │   0.0 │  -0.0 │   0.0 │  -0.0 │   0.0 │  -0.0 │  -0.0 │   0.0 │  -0.0 │  -0.0 │
│        SG │ -1.03 │ -0.53 │ -0.25 │ -0.12 │ -0.05 │ -0.02 │  -0.0 │  -0.0 │   0.0 │  -0.0 │   0.0 │  -0.0 │
│       SHR │ -4.96 │ -4.02 │ -3.07 │ -2.17 │ -1.34 │ -0.67 │ -0.24 │ -0.07 │ -0.03 │ -0.02 │ -0.01 │  -0.0 │
│       JEF │ -4.97 │ -3.95 │ -2.94 │ -1.95 │ -1.09 │ -0.45 │ -0.11 │ -0.01 │  -0.0 │   0.0 │   0.0 │  -0.0 │
│       LAP │ -5.01 │ -4.02 │ -3.04 │  -2.1 │ -1.24 │ -0.56 │ -0.15 │ -0.01 │  -0.0 │  -0.0 │  -0.0 │   0.0 │
│       MIN │ -2.07 │ -1.44 │ -0.94 │ -0.57 │ -0.31 │ -0.13 │ -0.04 │  -0.0 │   0.0 │   0.0 │   0.0 │  -0.0 │
│       BUB │ -3.39 │ -2.23 │ -0.68 │ -0.16 │  0.11 │  -0.0 │  0.11 │  0.02 │  0.02 │  0.01 │  0.01 │  0.02 │
│       NSB │   NaN │   NaN │   NaN │ -1.91 │  -1.2 │ -0.59 │  -0.2 │ -0.03 │ -0.01 │ -0.01 │   0.0 │  0.01 │
│    Unseen │ -6.43 │ -7.64 │ -8.58 │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │
│     Zhang │  0.01 │  0.01 │  0.01 │  0.01 │  0.01 │  0.01 │  0.02 │  0.02 │  0.02 │  0.02 │  0.01 │  0.01 │
│        CJ │  0.01 │  0.01 │  0.03 │ -0.01 │  0.02 │  0.01 │  0.01 │  0.02 │  0.02 │  0.01 │  0.01 │  0.01 │
│   UnveilJ │ -0.25 │ -0.27 │ -0.48 │  -0.9 │ -1.21 │ -0.78 │ -0.27 │ -0.02 │  0.01 │  0.01 │  0.01 │  0.01 │
│        BN │   0.3 │  0.58 │  1.01 │   1.5 │  1.99 │  2.35 │  2.55 │  2.56 │  2.52 │  2.49 │  2.47 │  2.46 │
│       GSB │ -0.17 │ -0.22 │  -0.2 │  -0.2 │ -0.15 │ -0.09 │ -0.02 │  0.01 │  0.02 │  0.02 │  0.01 │  0.01 │
│       SHU │  0.06 │  0.04 │  0.05 │  0.03 │  0.03 │  0.02 │  0.02 │  0.01 │  0.02 │  0.01 │  0.02 │  0.02 │
│       CDM │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │
└───────────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

The first table are the mean estimated entropies from DiscreteEntropy.jl, the second table are the deviations to Table 1 from the paper (the mean is usually caluclated from 1000 estimates, just for BUB and NSB I recuded this number because they were much slower than the other estimators). The rows are the different estimators and the columns the different sample sizes.

The entries which say NaN did not succeed to run. For CDM, I simple could not figure out what it corresponds to in your package, thus these values are missing, and for the rest you can find the error messages in the output:

NSB threw a Roots.ConvergenceFailed error for small sample sizes < 64 and Unseen threw a DimensionMismatch error for sample size > 32.

Otherwise you can see that the comparison succeeds for ChaoShen (CS), MaxiumLikelihood (ML), MillerMadow (MM), Zhang, ChaoWangJost (CJ) and Schürmann (SHU). For the rest, however, there are noticeable deviations.

I saw in your test cases that you compare against the R package entropy which works just fine, so I also tried using R to compare against the paper (in a reduced fashion, only for a sample size of 8 and the estimators which are available; I am not an R programmer ...):

library(entropy)

read_random_bytes <- function(n) {
  conn <- file("/dev/urandom", open = "rb")
  random_bytes <- readBin(conn, what = "raw", n = n)
  close(conn)
  return(random_bytes)
}

run <- function(draws) {
    for (m in c("ML", "MM", "Jeffreys", "Laplace", "SG", "minimax", "CS", "shrink")) {
        # `table()` does a frequency count
        vals <- sapply(1:draws, function(x) entropy(table(as.integer(read_random_bytes(8))), method = m, unit="log2"))
        print(paste(m, mean(vals)))
    }
}

suppressWarnings(run(1000))

Which yields:

[1] "ML 2.9735"
[1] "MM 3.59808838076576"
[1] "Jeffreys 2.97549971378874"
[1] "Laplace 2.97594394836141"
[1] "SG 2.97039231013123"
[1] "minimax 2.97438154499471"
[1] "CS 6.16299121698367"
[1] "shrink 2.97863055736597"

These are (almost) the same values that DiscreteEntropy.jl produces! But they still disagree with the paper, of course.

It might well be that I made a mistake in the comparsion. Can you check this and comment on the source of the deviations?

kellino commented 3 weeks ago

Thanks @nluetts. I'll do my best to answer these questions, but some will need more investigation. First, I have not implemented "CDM". It doesn't seem much used in the literature and is largely (as far as I can tell) superceded in the literature by PYM. I'll probably add it to the library at a later date if there is any demand of it.

I can't comment on the quality of the work in Rodriguez et al. and I have not tried to match against their results. I really only used it as a (partial) basis for what to implement. As far as possible, I went back to the authors original implementations (where they existed) and compared output against them.

NSB is an interesting case, as the author's implementation gives different results from the python implementation used by Rodriguez et al., which gives different results from DiscreteEntropy.jl. I cannot claim with certainty that my version is correct, but it does most faithfully copy the maths as presented in the paper. NSB (and some of the other bayesian estimators) uses integration: this could be a major source of divergence. To avoid complex rewriting to avoid overflow (as found in other NSB implementations) I used Julia's arbitrary precision floats. Integrating over these could easily introduce some of the deviation that you see. The question then becomes was the lack of precision considered by the authors when presenting their maths. My guess is, it was not, and therefore deviations are due to floating point imperfections. I noticed while writing the code that these do add up very quickly.

Also, at small sample sizes, entropy estimators vary tremendously from run to run. 1000 iterations may not be enough for stability at very small sizes. I note that as sample size increases, the tables converge. BN is interesting though. It is specifically for small datasets (I will add this information into the docs): divergence increases as more samples are added, which is the behaviour I think I would expect to see and at small sizes we again have the stability problem. Looking at the code in frequentist.jl I can't immediately see an error compared against the formula in the docs.

I will add a check for sample size on NSB < 64, though that sounds more like a bug. I'm not sure what's happening with the Unseen, please give me a few days to figure it out.

kellino commented 3 weeks ago

I've found the error in NSB: whenever the input has no coincidences (ie everything is seen only once), it fails. I need to figure out why that is happening and how to fix it. Thanks for the spot!

kellino commented 2 weeks ago

The original author's implementation of NSB returns NaN when there are no coincidences in the data, the python implementation does not. I have opted to go with the author's decision and return NaN. Unseen had a bug which has now been fixed. I've been unable to find any errors in the remaining estimators by comparing against other implementations, so I'm closing here, though the question remains open as to why Rodriguez reports some differing results.

nluetts commented 1 week ago

Hi @kellino

I ran your updated code against my script. For Unseen, the results are more realistic now, but still much worse than the other estimators and not what Contreras Rodríguez et al. report in their paper. It also fails some times:

[nluetts@fedora DiscreteEntropy-JOSS-Review]$ julia --project --load comparison.jl -e "main()"
###############################################################################

...

Running comparison for Unseen ...
Sample size: 8
Reference mean entropy: 5.202, DiscreteEntropy.jl: 0.06876407943033852
Sample size: 16
Reference mean entropy: 6.751, DiscreteEntropy.jl: 0.16074670482604692
Sample size: 32
Reference mean entropy: 7.739, DiscreteEntropy.jl: 0.3273587002393337
Sample size: 64
Reference mean entropy: 7.925, DiscreteEntropy.jl: 0.5499381518155609
Sample size: 128
Reference mean entropy: 7.945, DiscreteEntropy.jl: 0.9208798178505762
Sample size: 256
Reference mean entropy: 7.957, DiscreteEntropy.jl: 1.4577827444473084
Sample size: 512
Reference mean entropy: 7.95, DiscreteEntropy.jl: 2.162698415707029
Sample size: 1024
Reference mean entropy: 7.948, DiscreteEntropy.jl: 2.9027093645486657
Sample size: 2048
Error: unable to factorize the basis matrix (1)
ERROR running estimation: MathOptInterface.ResultIndexBoundsError{MathOptInterface.VariablePrimal}(MathOptInterface.VariablePrimal(1), 0)
Sample size: 4096
Error: unable to factorize the basis matrix (1)
ERROR running estimation: MathOptInterface.ResultIndexBoundsError{MathOptInterface.VariablePrimal}(MathOptInterface.VariablePrimal(1), 0)
Sample size: 8192
Error: unable to factorize the basis matrix (1)
ERROR running estimation: MathOptInterface.ResultIndexBoundsError{MathOptInterface.VariablePrimal}(MathOptInterface.VariablePrimal(1), 0)
Sample size: 16384
Reference mean entropy: 7.971, DiscreteEntropy.jl: 4.91103173101616
###############################################################################

┌───────────┬───────┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ Estimator │     8 │    16 │    32 │    64 │   128 │    256 │   512 │  1024 │  2048 │  4096 │  8192 │ 16384 │
├───────────┼───────┼───────┼───────┼───────┼───────┼────────┼───────┼───────┼───────┼───────┼───────┼───────┤
...
│    Unseen │ 0.069 │ 0.161 │ 0.327 │  0.55 │ 0.921 │  1.458 │ 2.163 │ 2.903 │   NaN │   NaN │   NaN │ 4.911 │
...
└───────────┴───────┴───────┴───────┴───────┴───────┴────────┴───────┴───────┴───────┴───────┴───────┴───────┘
kellino commented 1 week ago

@nluetts, now, this is getting interesting. I looked at the Rodriguez paper again and they state for Unseen (on page 4) "the authors propose to compute its value algorithmically". This is a strange sentence and makes me wonder if they manually computed values (and perhaps got it wrong?). It's true that unseen source code doesn't seem to be available as a repo anywhere that I could find, but it is included as an appendix in the paper. I worked from this appendix.

As for the error when sample size gets too big... I'm not sure what the issue is there. Many of these estimators do come with quite strong assumptions, so perhaps Unseen has some upper limit. I'll investigate this further.

nluetts commented 1 week ago

Also with regard to the discrepancies for SG, SHR, JEF, LAP, MIN, BUB, maybe it is a good idea to contact the authors and ask them what could have happened here? Maybe it is just my testing implementation which is not exactly what they did, but then I wonder why the comparison succeeds for the remaining estimators.

kellino commented 1 week ago

I'll reach out, but I hope this submission won't be blocked waiting for their reply... there was an error in BUB which was recently fixed in a comment, this might reduce the discrepency there. The others all seem to be those found in the R library, so perhaps there was a mistake in their R code?

nluetts commented 1 week ago

Before I forget it (again), you have to get rid of these two lines:

https://github.com/kellino/DiscreteEntropy.jl/blob/44590e710a6827b692f3d8bb8cf8c5b326277558/src/Core/countdata.jl#L161

https://github.com/kellino/DiscreteEntropy.jl/blob/44590e710a6827b692f3d8bb8cf8c5b326277558/src/Core/countdata.jl#L163

When I run my script, the output is quite enormous ...

kellino commented 1 week ago

Ouch, I must have left those in for debugging. I'll remove now

nluetts commented 1 week ago

I'll reach out, but I hope this submission won't be blocked waiting for their reply... there was an error in BUB which was recently fixed in a comment, this might reduce the discrepency there. The others all seem to be those found in the R library, so perhaps there was a mistake in their R code?

No, this does not need to block the submission. Since the R library gives the same results, it should be fine.

kellino commented 1 week ago

Calls to print() removed in 6963a95cf9ba19ad8fc5e8405bef402e07dde25b

nluetts commented 1 week ago

entropy does not seem to include automatic tests, though. At least I as a non R-programmer are not able to spot them (https://github.com/cran/entropy/tree/master). But I assume they tested well enough manually.

nluetts commented 1 week ago

@kellino OK, so my last open question is Unseen.

You said there was a bug which was fixed. After the fix, the negative entropy values disappeared and the results are more reasonable. The problem is, that your testcase

https://github.com/kellino/DiscreteEntropy.jl/blob/6963a95cf9ba19ad8fc5e8405bef402e07dde25b/test/estimator_test.jl#L24

did not catch this bug. It succeeded already before the bugfix. So this test alone does not seem to be sufficient.

I think the paper from Valiant & Valiant can be used for another testcase, though perhaps not a quantitative one, since they only seem to report graphs, not numbers.

The first plot in their Figure 1 shows how the error of Unseen evolves with sample size for a discrete uniform distribution from 1 to 1000. In Julia, I can sample from this distribution like so:

julia> rand(1:1000, 100) # 100 samples

So to estimate the entropy:

julia> estimate_h(from_samples(svector(rand(1:1000, 100))), Estimator)

If I calculate $e$ to the power of the entropy, I should get the number of distinct items in the discrete distribution. So if I oversample, the estimator should not have a problem to find that there are 1000 distinct numbers for this discrete distribution. And indeed:

julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), PYM) |> exp
1007.6118832749003

julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), NSB) |> exp
981.4373913617699

julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), ChaoShen) |> exp
954.4725165955898

For less samples, the estimates deviate more, but are still close enough:

julia> using Statistics

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), PYM) |> exp for _ in 1:500)
1381.6990603254876

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), PYM) |> exp for _ in 1:500)
1379.908995113134

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), NSB) |> exp for _ in 1:500)
629.4125878097738

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), NSB) |> exp for _ in 1:500)
629.793778092218

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), ChaoShen) |> exp for _ in 1:500)
1094.844219101877

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), ChaoShen) |> exp for _ in 1:500)
1097.9925117939667

Now Unseen:

julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), Unseen) |> exp
12.752808081795651

julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), Unseen) |> exp
12.527995830000302

julia> estimate_h(from_samples(svector(rand(1:1000, 10000))), Unseen) |> exp
12.812724640864127

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), Unseen) |> exp for _ in 1:500)
2.7727911829598364

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), Unseen) |> exp for _ in 1:500)
2.7721745172018806

This result is very far from the ground truth and far from the claim in the paper that the error of the estimate should approach zero in case of this distribution and 1000 samples (upper left plot in Figure 1).

Can you check your code again and test against more examples with the original matlab code?

nluetts commented 6 days ago

[...] It's true that unseen source code doesn't seem to be available as a repo anywhere that I could find, but it is included as an appendix in the paper. I worked from this appendix. [...]

In case you have not seen it:

https://theory.stanford.edu/~valiant/code.html

(Could also be added as a reference to the paper.)

kellino commented 4 days ago

@nluetts I've added more tests to catch the previously uncaught bug, but I'm not sure what's happening with the the most recent examples you've posted. It'll probably take me a few days to solve this, as time is a bit limited at the moment.

kellino commented 6 hours ago

@nluetts a genuinely stupid bug that took me a while to actually find, but only revealed itself when the sample size / support size ration became relatively large. This produces a much saner answer now.

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 10000))), Unseen) |> exp for _ in 1:500)
990.5554343074119
nluetts commented 5 hours ago

@kellino Great! And now it also still works quite well for less samples, as advertised in the paper:

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), Unseen) |> exp for _ in 1:500)
987.5433915762957

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 1000))), Unseen) |> exp for _ in 1:500)
987.8677858039398

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 100))), Unseen) |> exp for _ in 1:500)
1074.078845471124

julia> mean(estimate_h(from_samples(svector(rand(1:1000, 100))), Unseen) |> exp for _ in 1:500)
1027.9313289893134