Counts question - Githubissues

lostchord commented 3 years ago

I'm trying to use the 'counts' parameter to derive the non-fiction frequencies from the eng_2019 and eng_fiction_2019 data. My assumption was that the eng_fiction_2019 count would always be less than or equal to the eng_2019 count. This does not appear to be the case in all instances.

I'm also assuming that count/frequency is the total and that the differences between the counts and totals allows me to calculate the non-fiction frequency.

Have I got this wrong?

My test case is (html + HTML).

Cheers, Andrew

seancarmody commented 3 years ago

I should really add some more warnings in for the count option. It's a bit of an approximation and will probably only work well for single words (1-grams) and without operators (e.g. +).

When you said your test case was (html + HTML), did you callngram("(html + HTML)", corpus = "eng_fiction_2019")?

lostchord commented 3 years ago

I think I've got it. I was including the corpus in the query string rather than as a separate parameter. The code below illustrates the problem and shows it apparently going away if I use the corpus parameter.

q1 <- "(html + HTML):eng_2019" q2 <- "(html + HTML):eng_fiction_2019" q3 <- "(html + HTML)"

n1 <- ngram(q1, year_start=1945, year_end=2020, count=TRUE, smoothing=0) n2 <- ngram(q2, year_start=1945, year_end=2020, count=TRUE, smoothing=0)

n1a <- ngram(q3, corpus="eng_2019", year_start=1945, year_end=2020, count=TRUE, smoothing=0) n2a <- ngram(q3, corpus="eng_fiction_2019", year_start=1945, year_end=2020, count=TRUE, smoothing=0)

The corpus included in the string doesn't get reflected in the corpus returned dataframe but the numbers are different.

lostchord commented 3 years ago

Does your caveat imply that trying to derive a non-fiction figure is not going to be possible?

seancarmody commented 3 years ago

When you specify the corpus in the phrase that certainly messes up the count (I'll have to look into what I can do about that). so your calculations n1a and n2a will be closer to being right, but the + still leads to a small error. It's a little bit painful, but the most accurate results will be obtains if you query for HTML and html separately:

n1 <- ngram(c("html", "HTML"), corpus="eng_2019", year_start=1945, year_end=2020, 
            count=TRUE, smoothing=0) %>% 
  mutate(Phrase = stringr::str_to_lower(Phrase)) %>% 
  group_by(Year, Corpus, Phrase) %>%
  summarise(Frequency = sum(Frequency), Count = sum(Count), .groups = "keep")

You can then do the same thing with the fiction corpus.

seancarmody commented 3 years ago

I will think about any changes I can make to make this more efficient!

seancarmody commented 3 years ago

Actually you can do it in one go:

n1 <- ngram(c("html", "HTML"), corpus=c("eng_2019", "eng_fiction_2019"),  year_start=1945, 
            year_end=2020, count=TRUE, smoothing=0) %>% 
  mutate(Phrase = stringr::str_to_lower(Phrase)) %>% 
  group_by(Year, Corpus, Phrase) %>%
  summarise(Frequency = sum(Frequency), Count = sum(Count), .groups = "keep")

seancarmody commented 3 years ago

Using this last technique then allows you to calculate non-fiction:

n2 <- n1 %>% pivot_wider(id_cols = c(Year, Phrase), names_from = Corpus, values_from = Count) %>%
      mutate(eng_nonfiction_2019 = eng_2019 - eng_fiction_2019)

lostchord commented 3 years ago

That's really good! I've got plenty to play with there.

You mentioned problems with phrases, how severe are they likely to be? For example, if I'm looking at something like "software engineering" I'm stuck with a 2-gram.

My use case doesn't require particular precision, it's more about accounting for words/phrases that are common in fiction and muddy the technical waters.

seancarmody commented 3 years ago

I have a data set in the package with the 1-gram counts for each corpus and then I calculate the count as frequency x (number of 1-grams). To calculate for a 2-gram I (roughly) assume that the number of 2-grams is one less than the number of 1-grams (e.g. in the text "a b c d" there are four 1-grams and three 2-grams, namely "a b", "b c", "c d"). More generally then I'd calculate n-gram count as frequency (1-gram count - n + 1). I then try to adjust further by taking account separate books as frequency (1-gram count - (number of books) * (n + 1)) but at the moment I use number of pages which is wrong. Probably the right answer is to adjust by chapter but that data is not available. Better still I'd try to be n-gram counts for every n but I don't have that right now. Still it won't be too far off, so should be fine for most purposes.

The reason "HTML + html" doesn't quite work is that the code for calculating counts currently thinks that is a 3-gram so the 3-gram adjustment is used rather than the 1-gram calculation.

Some room for improvement there!

seancarmody commented 3 years ago

Actually - I've made that heavier going than it needs to be. I'd forgotten about the case insensitive option.

ngram("html", corpus=c("eng_2019", "eng_fiction_2019"), year_start=1945,
      year_end=2020, count=TRUE, smoothing=0, aggregate = TRUE,
      case_ins = TRUE, drop_all = TRUE) %>%
    pivot_wider(id_cols = c(Year, Phrase), names_from = Corpus,
                values_from = Count) %>%
    mutate(eng_nonfiction_2019 = eng_2019 - eng_fiction_2019)

should get you what you need.

seancarmody / ngramr

Counts question #33