Open lucid-dreams opened 1 week ago
Thanks for pointing this out: I will have a look. I think that the problem is that, since the web page the data is scraped from doesn't provide counts, I have a use an approximate calculation using a separate table of data with provides 1-gram counts by year. I don't think that calculation will work well with operators. I will try to investigate.
I've just realised there is another problem: I am also looking at doing an update for the latest corpus that has been released but it looks as though Google hasn't published n-gram counts.
Some counts are off by 2 to 3 % in version 1.9.3:
However the frequencies do seem to add up perfectly. The count is well defined and should be additive:
Yet "(der + die)" is no n-gram at all but either a "n-gram set specifier" (select all n-grams that are either "der" or "die") or it is an expression for using google as a calculator. Because e.g. "(cat + cat)" results in twice the frequency, it has no intuitive set interpretation and is probably meant in the the calculator sense.
As frequency is empirically additive in the google interface (and from their help page) , frequency must be:
This would make the result "being of mixed frequencies".
Knowing both – counts and frequencies – it's possible to derive the count of grams:
The bug results in rather large estimation errors. "(der + die)" with fixed Count results in Count/Frequency=72711921 – which would probably be correct. "(der die + die)" with fixed Count results in Count/Frequency=72709686 – which is somewhat confusing.
I haven't yet managed to extract counts from the google interface, although it's important to gauge sampling risks. E.g. nouns seem to become fewer since the year 2000.