seancarmody / ngramr

R package to query the Google Ngram Viewer
Other
48 stars 9 forks source link

Count inconsistencies #47

Open lucid-dreams opened 1 week ago

lucid-dreams commented 1 week ago

Some counts are off by 2 to 3 % in version 1.9.3:

> x=ngram(c("der","die","der die", "der+die","der die + die"), corpus = "de-2019", smoothing=0, count=TRUE)
> x
# Ngram data table
# Phrases:      (der + die), (der die + die), der, der die, die
# Case-sensitive:   TRUE
# Corpuses:     de-2019
# Smoothing:        0
# Years:        1800-2019

  Year  Corpus          Phrase Frequency   Count
1 1800 de-2019 (der die + die) 2.205e-02 1560805
2 1800 de-2019         der die 7.745e-05    5582
3 1800 de-2019             der 2.403e-02 1747143
4 1800 de-2019             die 2.197e-02 1597683
5 1800 de-2019     (der + die) 4.600e-02 3285704
6 1801 de-2019 (der die + die) 2.254e-02 1618642
# ... with 1094 more rows

> (1597683+1747143)  # line 3 + line 4
[1] 3344826 
> 3285704-3344826 # line 5 minus the above sum
[1] -59122
> -59122/3285704 # relative error
[1] -0.01799

However the frequencies do seem to add up perfectly. The count is well defined and should be additive:

count of "(house + cat)" = (count of "house") + (count of "cat")

Yet "(der + die)" is no n-gram at all but either a "n-gram set specifier" (select all n-grams that are either "der" or "die") or it is an expression for using google as a calculator. Because e.g. "(cat + cat)" results in twice the frequency, it has no intuitive set interpretation and is probably meant in the the calculator sense.

As frequency is empirically additive in the google interface (and from their help page) , frequency must be:

frequency of "(cat + house cat)" = (frequency of "cat") + (frequency of "house cat") = (count of "cat" / count of 1-grams) + (count of "house cat" / count of 2-grams) +

This would make the result "being of mixed frequencies".

Knowing both – counts and frequencies – it's possible to derive the count of grams:

> mutate(x,Count/Frequency)[c(3,4,5,2,1),]
# Ngram data table
# Phrases:      (der + die), (der die + die), der, der die, die
# Case-sensitive:   
# Corpuses:     de-2019
# Smoothing:        
# Years:        1800-1800

  Year  Corpus          Phrase Frequency   Count Count/Frequency
3 1800 de-2019             der 2.403e-02 1747143        72711921 # 1-grams, probably correct
4 1800 de-2019             die 2.197e-02 1597683        72711922 # 1-grams, probably correct
5 1800 de-2019     (der + die) 4.600e-02 3285704        71426690 # 1-grams, should be: Count=3344826
2 1800 de-2019         der die 7.745e-05    5582        72075263 # 2-grams, probably correct
1 1800 de-2019 (der die + die) 2.205e-02 1560805        70784082 # 1:2-grams, should be: Count=1603265

The bug results in rather large estimation errors. "(der + die)" with fixed Count results in Count/Frequency=72711921 – which would probably be correct. "(der die + die)" with fixed Count results in Count/Frequency=72709686 – which is somewhat confusing.

I haven't yet managed to extract counts from the google interface, although it's important to gauge sampling risks. E.g. nouns seem to become fewer since the year 2000.

seancarmody commented 4 days ago

Thanks for pointing this out: I will have a look. I think that the problem is that, since the web page the data is scraped from doesn't provide counts, I have a use an approximate calculation using a separate table of data with provides 1-gram counts by year. I don't think that calculation will work well with operators. I will try to investigate.

seancarmody commented 4 days ago

I've just realised there is another problem: I am also looking at doing an update for the latest corpus that has been released but it looks as though Google hasn't published n-gram counts.