quanteda / quanteda.textstats

Textual statistics for quanteda
GNU General Public License v3.0
14 stars 2 forks source link

Yule's K computation #46

Closed ybestgen closed 2 years ago

ybestgen commented 2 years ago

In the formula for calculating Yule's K, N (ntokens) is substracted from the sum in the numerator. See:

This is what koRpus does, but, according to my calculations, this is not the case for quanteda.

Texte k1

a b c d d e e f f f

m Vm 1 3 : 3 1 1 + 2 2 : 2 2 2 + 3 1 : 1 3 3 =

          (20 - 10) / 100 * 10000 = 1000
          (20)      / 100 * 10000 = 2000

Quanteda doc TTR C R CTTR U S K I D Vm Maas lgV0 lgeV0 1 k1.txt 0.6 0.7781513 1.897367 1.341641 4.507576 -Inf 2000 2.571429 0.1111111 0.1825742 0.4710082 1.238943 2.852771

koRpus

K@K.ld [1] 1000

==================

Text k2

a b c d d e e f f f g g g g

m Vm 1 3 : 3 1 1 + 2 2 : 2 2 2 + 3 1 : 1 3 3 + 4 1 : 1 4 4 = (36 - 14) / 14 / 14 10000 = 1122.448979591836735
(36) / 14 / 14
10000 = 1836.73469387755102

Quanteda doc TTR C R CTTR U S K I D Vm Maas lgV0 lgeV0 2 k2.txt 0.5 0.7373505 1.870829 1.322876 4.363716 -1.233987 1836.735 1.689655 0.1208791 0.2020305 0.4787092 1.251051 2.880652

koRpus K@K.ld [1] 1122.449

All the best,

Yves

kbenoit commented 2 years ago

Thanks for pointing this out! Will look into it asap. Moving to quanteda.textstats since this is where textstat_readability() is now found.