match.matrix(p.188) - Githubissues

Looks like some errata on my part. Thx for sharing.

Here is a cleaner version that functions.

RCurl is only needed to get the data directly from github. Otherwise you won't need it.


# Libs
require(RCurl)
require(tm)

Ensure the data is considered strings

# Options
options(stringsAsFactors = F)

Fetch the data from github or you could just use read.csv if you have it locally.

# Data
headlines<-read.csv(text=getURL("https://raw.githubusercontent.com/kwartler/text_mining/master/all_3k_headlines.csv"))

Here is a generic cleaning function, you can adjust as needed.

# Custom Function
headline.clean<-function(x){
  x<-tolower(x)
  x<-removeWords(x,stopwords('en'))
  x<-removePunctuation(x)
  x<-stripWhitespace(x)
  return(x)
}

When importing from raw git, sometimes there are special characters. You could add this function to the cleaning function if needed instead but I kept it separate for this simple demonstration.

# Some special characters can cause issues (could be part of the clean function)
headlines$headline<-gsub("[^[:graph:]]", " ",headlines$headline)

Apply the function to clean all the data. In a real world scenario you could do this to any new text data coming in but here I apply it to the entire corpus prior to partitioning.

# Apply cleanning function
clean.train<-headline.clean(headlines$headline)

I used sample but you can use any partitioning schema e.g. from caret like in the book.

# Quick Partitioning
train<-sample(1:nrow(headlines),2500, replace=F)
train.headlines<-clean.train[train]
test.headlines<-clean.train[-train]

Remember to change your source and you can use getSources() to see a list of available sources. Also in the new tm package you have to remember that readTabular was deprecated in favor of the data frame source.

# Make a VCorpus
train.corp<-VCorpus(VectorSource(train.headlines))

# Construct a DTM
train.dtm<-DocumentTermMatrix(train.corp)

Here is the revised matchMatrix() function. This time I made it more straight forward and it MUST have an original DTM to work from. The wgt parameter is a string for the DTM term weight. It defaults to term frequency but needs to match the original. The function accepts a vector of new text, the original DTM and the weight inputs.


matchMatrix<-function(textVec,originalDTM,wgt='weightTf'){

  # One last cleaning to make sure it works
  newTxt <- sapply(as.vector(textVec, mode = "character"),
                   iconv, to = "UTF8", sub = "byte")

  # Make a Test Set Corpus
  newCorpus <- VCorpus(VectorSource(newTxt))

  # Make the Test Set DTM
  ctrl<-list(wgt)
  mat <- DocumentTermMatrix(newCorpus, control = ctrl)

  # Find differing terms
  emptyTerms<-setdiff(colnames(originalDTM) ,colnames(mat))

  # Check wgt
  if (attr(originalDTM, "weighting")[2] == "tfidf"){
    weight <- 0.000000001
  } else {
    weight <- 0
  }

  # Construct empty cols
  emptyMat <- matrix(weight, nrow = nrow(mat), ncol = length(emptyTerms))

  # Add names
  colnames(emptyMat) <- emptyTerms
  rownames(emptyMat) <- rownames(mat)

  # Find common terms
  commonTerms<-colnames(mat)[colnames(mat) %in% colnames(originalDTM)]

  # Append the original data
  joinDTM<-cbind(emptyMat,mat[,commonTerms])
  joinDTM<-as.DocumentTermMatrix(joinDTM,weighting = wgt)

  #Re-order
  joinDTM<-joinDTM[,sort(colnames(originalDTM))]

  # Response
 return(joinDTM)
}

Here you apply the function with the needed info.

testDTM<-matchMatrix(textVec=test.headlines, originalDTM = train.dtm,wgt='weightTf')

You can check the column names and dimensions below.

# Check
head(train.dtm$dimnames$Terms)
head(testDTM$dimnames$Terms)

dim(train.dtm)
dim(testDTM)

In this example, the new DTM has 500 documents, and the original 2500. The number of terms should be the same along with the colnames themselves. In this way you can prepare a new set of text for modeling and analysis.

kwartler / text_mining

match.matrix(p.188) #2

------------ End of an issue