Yule's K computation - Githubissues

In the formula for calculating Yule's K, N (ntokens) is substracted from the sum in the numerator. See:

Baayen (2001) Word frequency distribution, p.25.
Jarvis (2002) Short texts, best fitting curves and new measures of lexical diversity, p.59 (second formula).
Tweedie & Baayen (1998) How Variable May a Constant be? Measures of Lexical Richness in Perspective, p.330, but here the formula is rather strange with the use of i/N

This is what koRpus does, but, according to my calculations, this is not the case for quanteda.

Texte k1

a b c d d e e f f f

m Vm 1 3 : 3 1 1 + 2 2 : 2 2 2 + 3 1 : 1 3 3 =

          (20 - 10) / 100 * 10000 = 1000
          (20)      / 100 * 10000 = 2000

Quanteda doc TTR C R CTTR U S K I D Vm Maas lgV0 lgeV0 1 k1.txt 0.6 0.7781513 1.897367 1.341641 4.507576 -Inf 2000 2.571429 0.1111111 0.1825742 0.4710082 1.238943 2.852771

koRpus

K@K.ld [1] 1000

==================

Text k2

a b c d d e e f f f g g g g

m Vm 1 3 : 3 1 1 + 2 2 : 2 2 2 + 3 1 : 1 3 3 + 4 1 : 1 4 4 = (36 - 14) / 14 / 14 10000 = 1122.448979591836735
(36) / 14 / 14 10000 = 1836.73469387755102

Quanteda doc TTR C R CTTR U S K I D Vm Maas lgV0 lgeV0 2 k2.txt 0.5 0.7373505 1.870829 1.322876 4.363716 -1.233987 1836.735 1.689655 0.1208791 0.2020305 0.4787092 1.251051 2.880652

koRpus K@K.ld [1] 1122.449

All the best,

Yves

quanteda / quanteda.textstats

Yule's K computation #46

m Vm 1 3 : 3 1 1 + 2 2 : 2 2 2 + 3 1 : 1 3 3 =