kwartler / text_mining

This repo contains data from Ted Kwartler's "Text Mining in Practice With R" book.
53 stars 67 forks source link

Error function name headline.cleanv (page 188) #5

Closed ingetic closed 6 years ago

ingetic commented 6 years ago

in the page 188

clean.train<-headline.cleanv(train.headlines$headline)

must be

clean.train<-headline.cleanv(train.headlines$headline)

kwartler commented 6 years ago

will take a look and have something in a day or so.

kwartler commented 6 years ago

The following code works. Please ensure it's not headline.cleanv which is a typo. Also, note that I added the x<-gsub("[^[:graph:]]", " ",x) above the tolower() function. This is because the data set contains some special characters.

For example: invalid input 'Pokémon Go is getting a buddy system' in 'utf8towcs' can be avoided with the gsub() cleaning function.

# Libs
require(RCurl)
require(tm)

# Options
options(stringsAsFactors = F)

# Data
headlines<-read.csv(text=getURL("https://raw.githubusercontent.com/kwartler/text_mining/master/all_3k_headlines.csv"))

train<-createDataPartition(headlines$y,p=0.5,list=F)
train.headlines<-headlines[train,]
test.headlines<-headlines[-train,]

# Custom Function
headline.clean<-function(x){
  x<-gsub("[^[:graph:]]", " ",x) 
  x<-tolower(x)
  x<-removeWords(x,stopwords('en'))
  x<-removePunctuation(x)
  x<-stripWhitespace(x)
  return(x)
}

clean.train<-headline.clean(train.headlines$headline)

Here is a snippet to review the results:

> head(clean.train)
[1] "mom sentenced 6 years prison putting feces sick sons iv"               
[2] " shocking jerry springer episode ever"                                 
[3] " man spots reckless driver stops m shocked omg"                        
[4] "alert diet pepsi removes aspartame replaces equally dangerous chemical"
[5] "13 trump supporters think joking second amendment"                     
[6] "want stop gun violence end war drugs"