Closed ingetic closed 6 years ago
will take a look and have something in a day or so.
The following code works. Please ensure it's not headline.cleanv
which is a typo. Also, note that I added the x<-gsub("[^[:graph:]]", " ",x)
above the tolower()
function. This is because the data set contains some special characters.
For example: invalid input 'Pokémon Go is getting a buddy system' in 'utf8towcs'
can be avoided with the gsub()
cleaning function.
# Libs
require(RCurl)
require(tm)
# Options
options(stringsAsFactors = F)
# Data
headlines<-read.csv(text=getURL("https://raw.githubusercontent.com/kwartler/text_mining/master/all_3k_headlines.csv"))
train<-createDataPartition(headlines$y,p=0.5,list=F)
train.headlines<-headlines[train,]
test.headlines<-headlines[-train,]
# Custom Function
headline.clean<-function(x){
x<-gsub("[^[:graph:]]", " ",x)
x<-tolower(x)
x<-removeWords(x,stopwords('en'))
x<-removePunctuation(x)
x<-stripWhitespace(x)
return(x)
}
clean.train<-headline.clean(train.headlines$headline)
Here is a snippet to review the results:
> head(clean.train)
[1] "mom sentenced 6 years prison putting feces sick sons iv"
[2] " shocking jerry springer episode ever"
[3] " man spots reckless driver stops m shocked omg"
[4] "alert diet pepsi removes aspartame replaces equally dangerous chemical"
[5] "13 trump supporters think joking second amendment"
[6] "want stop gun violence end war drugs"
in the page 188
clean.train<-headline.cleanv(train.headlines$headline)
must be
clean.train<-headline.cleanv(train.headlines$headline)