quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
120 stars 28 forks source link

tokens not removing anything #91

Closed Monduiz closed 7 years ago

Monduiz commented 7 years ago

I am using tokenize and it appears to be removing none of the things it should. I am using the current github version of quanteda. The resulting table still has all the numbers, hyphens, etc. I am not sure what I am doing wrong. Thanks for any insights. I didnt inlcude the file nor the code for bad words but dfm didnt remove them either.

Data:

fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "./data/data.zip")
unzip("./data/data.zip", exdir = "./data")

library(tidyverse)
library(quanteda)
library(data.tables)

blogs <- read_lines("./data/final/en_US/en_US.blogs.txt")
sampblogs <- sample(blogs, length(blogs) * 0.3)
corpus_data <- corpus(sampblogs)

unigrams <- tokens(corpus_data, what = "word", remove_symbols = TRUE, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove_separators = TRUE, ngrams = 1, concatenator = " ")
unigrams <- dfm(unigrams, remove = profanities)
unig_dt <- data.table(ngram = featnames(unigrams), count = colSums(unigrams), key = "ngram")
Monduiz commented 7 years ago

Wrong repo!