seancarmody / ngramr

R package to query the Google Ngram Viewer
Other
48 stars 9 forks source link

R and Google plots not lining up #34

Closed bfbraum closed 2 years ago

bfbraum commented 2 years ago

Hi, and thanks for creating this package! I'm having trouble getting it to work for me, and I hope you can offer some advice.

I've got it to the point where it downloads and plots the ngram data for me, but the plot really doesn't resemble the equivalent (?) plot I'm getting from Google. The Google graph is here:

https://books.google.com/ngrams/graph?content=%22international+order%22%2C%22international+institutions%22%2C%22international+regimes%22&year_start=1900&year_end=2019&corpus=26&smoothing=2&case_insensitive=true#

I've tried to reproduce it with the following code, borrowed / modified from Daisung Jang's tutorial at https://daisungjang.com/tutorial/Ngram_tutorial.html:

library(ngramr)
data <- as.data.frame(matrix(ncol=1, nrow=109))
data$V1 <- seq(from=1900, to=2008)
names(data)[names(data)=="V1"] <- "Year"
search_terms <- c("international order", "international institutions", "international regimes")

for(i in 1:length(search_terms)){

  # Get each search term and store those in objects
  term <- search_terms[i]

  # Search for the term in the English 2012 corpus, starting from the year 1900 to 2008
  # Then house the output in a dataframe
  temp <- ngram(term, year_start = 1900, corpus="eng_2019", smoothing = 2)

  # Merge NYT data with dataframe created step 1, matching by years
  data <- merge(data, temp[,c("Year", "Frequency")], by ="Year", all.x=TRUE)

  # Reaname column by search term
  colname <- paste(term, sep="")

  # Rename added column with ID
  names(data)[names(data)=="Frequency"] <- colname

}

data_long <- reshape(data, 
                     varying = c("international order", "international institutions", "international regimes"), 
                     v.names = "Frequency",
                     timevar = "search_term", 
                     times = c("international order", "international institutions", "international regimes"), 
                     direction = "long")

library(ggplot2)

p <- ggplot(data_long, aes(x=Year, y=Frequency, group=search_term))

p +  geom_line(aes(colour = search_term))

As you'll see, the trends look very different. In the Google version, there's a surge in the use of the phrase "international institutions" after WWII; in the R version, there's a nearly identical surge, but in the use of a different phrase, "international order." That term then more or less flatlines in the Google version but continues to climb in the R version. The curve for "international regimes" is approximately right, but not exactly, and it maps to about the same y-axis scale as it does on the Google version, while the others appear to be on very different scales.

All in all, there are enough similarities to make me suspect that I'm more or less on the right track, but the overall pictures are dramatically different. I've tried varying all the ngramr parameters that I can find, but no combination I've tried produces a graph that looks like Google's. Any help appreciated!, and apologies in advance if this is a me problem.

seancarmody commented 2 years ago

Let me take a look and get back to you...

seancarmody commented 2 years ago

It looks as though something unusual is happening as a result of the inverted commas. If you have a look at the Google ngram viewer page linked to below, you'll see the same result as the ngramr code generates.

https://books.google.com/ngrams/graph?content=international+institutions%2Cinternational+order%2Cinternational+regimes&year_start=1900&year_end=2019&corpus=26&smoothing=2

Note that this chart is case sensitive, so will not include the variants with the i's capitalised.

I'm not sure exactly what is happening with the ngram chart you've created directly in the Google viewer, but I note there are warning messages displayed, including

Replaced "international order" with " international order " to match how we processed the books.

Also, the frequencies in the inverted comma chart are far lower (by two orders of magnitude) than in the chart without inverted commas, so it looks as though it's missing a lot of cases. I would therefore suggest that the results you are getting from the ngramr code are in fact more accurate. To ensure that you get a case insensitive search you can use the parameters case_ins=TRUE and aggregate=TRUE (without the latter the data will split, for example, 'international institutions' and 'International institutions' separately).

As an aside, the sample code is a little long-winded and you can instead use something like this:

ggram(c("international order", "international institutions", "international regimes"),
      year_start=1900, year_end=2019, smoothing=2, case_ins=TRUE, aggregate=TRUE)

or

data_long <- ngram(c("international order", "international institutions", "international regimes"), 
                   year_start=1900, year_end=2019, smoothing=2, case_ins=TRUE, aggregate=TRUE)
ggram(data_long)

While that doesn't completely clarify what is going on, with any luck this enough to keep you going. Let me know how you go.

bfbraum commented 2 years ago

Ah, fantastic, thanks so much. I had no idea that the ngram interface was so fragile. Noted for future reference, and thanks for a really cool and easy-to-use package (easier than the ngram viewer itself, it turns out....)

seancarmody commented 2 years ago

No problem. Happy to help! I hadn't realised this particular peculiarity myself. I'm also conscious that the fragility of the interface can translate to fragility of the package since it just scrapes calls to the web page.

This comparison highlights more clearly the difference between searches with and without inverted commas:

https://books.google.com/ngrams/graph?content=%22international+institutions%22%2Cinternational+institutions&year_start=1800&year_end=2019&corpus=26&smoothing=3&direct_url=t1%3B%2C%22%20international%20institutions%20%22%3B%2Cc0%3B.t1%3B%2Cinternational%20institutions%3B%2Cc0