Closed lostchord closed 2 years ago
I should really add some more warnings in for the count
option. It's a bit of an approximation and will probably only work well for single words (1-grams) and without operators (e.g. +).
When you said your test case was (html + HTML), did you callngram("(html + HTML)", corpus = "eng_fiction_2019")
?
I think I've got it. I was including the corpus in the query string rather than as a separate parameter. The code below illustrates the problem and shows it apparently going away if I use the corpus parameter.
q1 <- "(html + HTML):eng_2019" q2 <- "(html + HTML):eng_fiction_2019" q3 <- "(html + HTML)"
n1 <- ngram(q1, year_start=1945, year_end=2020, count=TRUE, smoothing=0) n2 <- ngram(q2, year_start=1945, year_end=2020, count=TRUE, smoothing=0)
n1a <- ngram(q3, corpus="eng_2019", year_start=1945, year_end=2020, count=TRUE, smoothing=0) n2a <- ngram(q3, corpus="eng_fiction_2019", year_start=1945, year_end=2020, count=TRUE, smoothing=0)
The corpus included in the string doesn't get reflected in the corpus returned dataframe but the numbers are different.
Does your caveat imply that trying to derive a non-fiction figure is not going to be possible?
When you specify the corpus in the phrase that certainly messes up the count (I'll have to look into what I can do about that). so your calculations n1a and n2a will be closer to being right, but the + still leads to a small error. It's a little bit painful, but the most accurate results will be obtains if you query for HTML and html separately:
n1 <- ngram(c("html", "HTML"), corpus="eng_2019", year_start=1945, year_end=2020,
count=TRUE, smoothing=0) %>%
mutate(Phrase = stringr::str_to_lower(Phrase)) %>%
group_by(Year, Corpus, Phrase) %>%
summarise(Frequency = sum(Frequency), Count = sum(Count), .groups = "keep")
You can then do the same thing with the fiction corpus.
I will think about any changes I can make to make this more efficient!
Actually you can do it in one go:
n1 <- ngram(c("html", "HTML"), corpus=c("eng_2019", "eng_fiction_2019"), year_start=1945,
year_end=2020, count=TRUE, smoothing=0) %>%
mutate(Phrase = stringr::str_to_lower(Phrase)) %>%
group_by(Year, Corpus, Phrase) %>%
summarise(Frequency = sum(Frequency), Count = sum(Count), .groups = "keep")
Using this last technique then allows you to calculate non-fiction:
n2 <- n1 %>% pivot_wider(id_cols = c(Year, Phrase), names_from = Corpus, values_from = Count) %>%
mutate(eng_nonfiction_2019 = eng_2019 - eng_fiction_2019)
That's really good! I've got plenty to play with there.
You mentioned problems with phrases, how severe are they likely to be? For example, if I'm looking at something like "software engineering" I'm stuck with a 2-gram.
My use case doesn't require particular precision, it's more about accounting for words/phrases that are common in fiction and muddy the technical waters.
I have a data set in the package with the 1-gram counts for each corpus and then I calculate the count as frequency x (number of 1-grams). To calculate for a 2-gram I (roughly) assume that the number of 2-grams is one less than the number of 1-grams (e.g. in the text "a b c d" there are four 1-grams and three 2-grams, namely "a b", "b c", "c d"). More generally then I'd calculate n-gram count as frequency (1-gram count - n + 1). I then try to adjust further by taking account separate books as frequency (1-gram count - (number of books) * (n + 1)) but at the moment I use number of pages which is wrong. Probably the right answer is to adjust by chapter but that data is not available. Better still I'd try to be n-gram counts for every n but I don't have that right now. Still it won't be too far off, so should be fine for most purposes.
The reason "HTML + html" doesn't quite work is that the code for calculating counts currently thinks that is a 3-gram so the 3-gram adjustment is used rather than the 1-gram calculation.
Some room for improvement there!
Actually - I've made that heavier going than it needs to be. I'd forgotten about the case insensitive option.
ngram("html", corpus=c("eng_2019", "eng_fiction_2019"), year_start=1945,
year_end=2020, count=TRUE, smoothing=0, aggregate = TRUE,
case_ins = TRUE, drop_all = TRUE) %>%
pivot_wider(id_cols = c(Year, Phrase), names_from = Corpus,
values_from = Count) %>%
mutate(eng_nonfiction_2019 = eng_2019 - eng_fiction_2019)
should get you what you need.
I'm trying to use the 'counts' parameter to derive the non-fiction frequencies from the eng_2019 and eng_fiction_2019 data. My assumption was that the eng_fiction_2019 count would always be less than or equal to the eng_2019 count. This does not appear to be the case in all instances.
I'm also assuming that count/frequency is the total and that the differences between the counts and totals allows me to calculate the non-fiction frequency.
Have I got this wrong?
My test case is (html + HTML).
Cheers, Andrew