mguideng / gdscrapeR

:package: R package to easily web scrape Glassdoor company reviews. Write up of demo:
https://mguideng.github.io/2019-02-27-scrape-glassdoor-gdscrapeR/
24 stars 8 forks source link
company-reviews glassdoor rpackage rvest scrape scrape-glassdoor text-mining webscraper webscraping

gdscrapeR: scrape Glassdoor company reviews in R

GitHub Release Date

ABOUT

gdscrapeR is an R package that scrapes company reviews from Glassdoor using a single function: get_reviews. It returns a data frame structure for holding the text data, which can be further prepped for text analytics learning projects.

INSTALL & LOAD

The latest version from GitHub:

install.packages("devtools")
devtools::install_github("mguideng/gdscrapeR")

library(gdscrapeR)

USAGE

Example

The URL to scrape the awesome SpaceX company will be: www.glassdoor.com/Reviews/SpaceX-Reviews-E40371.htm.

spacex-url

Function

Pass the company number through the get_reviews function. The company number is a string representing a company's unique ID number. Identified by navigating to a company's Glassdoor reviews web page and reviewing the URL for characters between "Reviews-" and ".htm" (usually starts with an "E" and followed by digits).

# Create data frame of: Date, Summary, Rating, Title, Pros, Cons, Helpful
df <- get_reviews(companyNum = "E40371")

This will scrape the following variables:

Result

spacex-results

PREP FOR TEXT ANALYTICS

RegEx & Export

Use regular expressions to clean and extract additional variables and then export:

#### REGEX ####
# Package
library(stringr)    # pattern matching functions

# Add: PriKey (uniquely identify rows 1 to N, sorted from first to last review by date)
df$rev.pk <- as.numeric(rownames(df))

# Extract: Year, Status, Position, Location 
df$rev.year <- as.numeric(sub(".*, ","", df$rev.date))

df$rev.stat <- str_match(df$rev.title, ".+?(?= Employee -)")

df$rev.pos <- str_replace_all(df$rev.title, ".* Employee - |\\sin .*|\\s$", "")

df$rev.loc <- sub(".*\\sin ", "", df$rev.title)
df$rev.loc <- ifelse(df$rev.loc %in% 
                       (grep("Former Employee|Current Employee|^+$", df$rev.loc, value = T)), 
                     "Not Given", df$rev.loc)

# Clean: Pros, Cons, Helpful
df$rev.pros <- gsub("&amp;", "&", df$rev.pros)

df$rev.cons <- gsub("&amp;", "&", df$rev.cons)

df$rev.helpf <- as.numeric(gsub("\\D", "", df$rev.helpf))

#### EXPORT ####
write.csv(df, "df-results.csv", row.names = F)

Exploration ideas

gdscrapeR was made for learning purposes. Analyze the unstructured text, extract relevant information, and transform it into useful insights.

If you find this package useful, feel free to star :star: it. Thanks for visiting :heart: .

NOTES