johnmyleswhite / ML_for_Hackers

Code accompanying the book "Machine Learning for Hackers"
http://shop.oreilly.com/product/0636920018483.do
3.67k stars 2.22k forks source link

Chapter 3 - Error executing get.msg() #4

Open erwtokritos opened 12 years ago

erwtokritos commented 12 years ago

Hello guys,

Great book :-) Right now, I am in the 3rd chapter (e-mail classification). I am executing the R commands one by one andi am having a problem getting the list of spam documents (page 81). The command is : all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path,p,sep="")))

and the error i get is Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

Any clue? Thank you very much

cesarblum commented 12 years ago

I wish there was some way to upvote an issue. I'm having the exact same problem. I figured out that the problem seems to be with the "encoding" argument to the "file" function. If you remove it, it works, but the results you get are somewhat different from those in the book. Also, some weird tokens appear in the list of words found in the corpus. Someone also reported this problem at the Unconfirmed Errata page for the book at O'Reilly: http://oreilly.com/catalog/errataunconfirmed.csp?isbn=0636920018483

johnmyleswhite commented 12 years ago

Sorry about the lag on this, all. We'll look into it more this weekend and report back.

drewconway commented 12 years ago

I am having trouble replicating the error. The current version of the code in the repository reads as follows:

# Get all the SPAM-y email into a single vector
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs != "cmds")]
all.spam <- sapply(spam.docs,
               function(p) get.msg(file.path(spam.path, p)))

It runs fine for me on OS X and Ubuntu. So, perhaps the issue is the use of paste rather than the file command, or an operating system issue. The paste function does appear in the text of the book, which should fixed in future editions.

cesarblum commented 12 years ago

I still get the errors when using file.path. These are the errors I get:

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.) In addition: Warning messages: 1: In readLines(con) : invalid input found on input connection 'data/spam//00006.5ab5620d3d7c6c0db76234556a16f6c1' 2: In readLines(con) : invalid input found on input connection 'data/spam//00009.027bf6e0b0c4ab34db3ce0ea4bf2edab' 3: In readLines(con) : invalid input found on input connection 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 4: In readLines(con) : incomplete final line found on 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 5: In readLines(con) : invalid input found on input connection 'data/spam//00035.7ce3307b56dd90453027a6630179282e' 6: In readLines(con) : incomplete final line found on 'data/spam//00035.7ce3307b56dd90453027a6630179282e'

The problems seems to be with the encoding argument of the file function called in get.msg. If I remove encoding="latin1", the code runs without errors, but the results are quite different from those presented in the book.

I'm working on OS X with R 2.15.0.

johnmyleswhite commented 12 years ago

What operation system and version of R are you using?

-- John

On Apr 21, 2012, at 9:01 AM, Cesar L. B. Silveira wrote:

I still get the errors when using file.path. These are the errors I get:

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.) In addition: Warning messages: 1: In readLines(con) : invalid input found on input connection 'data/spam//00006.5ab5620d3d7c6c0db76234556a16f6c1' 2: In readLines(con) : invalid input found on input connection 'data/spam//00009.027bf6e0b0c4ab34db3ce0ea4bf2edab' 3: In readLines(con) : invalid input found on input connection 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 4: In readLines(con) : incomplete final line found on 'data/spam//00031.a78bb452b3a7376202b5e62a81530449' 5: In readLines(con) : invalid input found on input connection 'data/spam//00035.7ce3307b56dd90453027a6630179282e' 6: In readLines(con) : incomplete final line found on 'data/spam//00035.7ce3307b56dd90453027a6630179282e'

The problems seems to be with the encoding argument of the file function called in get.msg. If I remove encoding="latin1", the code runs without errors, but the results are quite different from those presented in the book.


Reply to this email directly or view it on GitHub: https://github.com/johnmyleswhite/ML_for_Hackers/issues/4#issuecomment-5260339

cesarblum commented 12 years ago

I'm using OS X Lion with R 2.15.0 (installed from MacPorts).

hanfeisun commented 12 years ago

I also has this error..

foxet commented 12 years ago

That's because of the data files,not the code, open and check the data/spam/000*..which is not a email,but a file list

quasiben commented 12 years ago

@foxet is right. The file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' causes the problem. I amended the mask function to include files which begin with '0000.':

spam.docs <- spam.docs[which( !str_detect(spam.docs,"^0000.") & spam.docs != 'cmds' )]

adayone commented 11 years ago

It's the problem of encoding. ReadLines should be useful no matter it is an email. con <- file(path, open="rt") instead of con <- file(path, open="rt", encoding="utf-8") will be work.

ceekr commented 11 years ago

The encoding changes does NOT seem to alter the behavior. I am running this on R 2.15.2 on Windows 7 x64. Here is my function:

get.msg <- function(path) { con <- file(path, open="rt", encoding="native.enc") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(msg, collapse="\n"))

}

I have changed encoding to "utf-8", "latin1" and nothing happens. Same error.

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

I also applied the suggestions by foxet and quasiben. The fact is my spam folder does not have this file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' at all.

What am I missing, folks?

adayone commented 11 years ago

Do not define parameter "encoding", just use

con <- file(path, open="rt")

2012/11/1 Kingshuk Chatterjee notifications@github.com

The encoding changes does NOT seem to alter the behavior. I am running this on R 2.15.2 on Windows 7 x64. Here is my function:

get.msg <- function(path) { con <- file(path, open="rt", encoding="native.enc") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(msg, collapse="\n")) }

I have changed encoding to "utf-8", "latin1" and nothing happens. Same error.

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

I also applied the suggestions by foxet and quasiben. The fact is my spam folder does not have this file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' at all.

What am I missing, folks?

— Reply to this email directly or view it on GitHubhttps://github.com/johnmyleswhite/ML_for_Hackers/issues/4#issuecomment-9969386.

ceekr commented 11 years ago

Alright, I did this now: (Removed the encoding parameter)

get.msg <- function(path) { con <- file(path, open="rt") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(text, collapse="\n"))

}

Ran the whole bunch again. The outcome:

            Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

So, like I said earlier, the encoding parameter does not seem to have any affect. Again, I am running this on Windows 7 x64. And here is my whole bunch:

           spam.path <- "datasets/spam/"
           easyham.path <- "datasets/easy_ham/"
           hardham.path <- "datasets/hard_ham/"

           get.msg <- function(path) {
                    con <- file(path, open="rt")
                    text <- readLines(con)
                    # The message always begins after the first full line break
                    msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
                    close(con)
                    return(paste(text, collapse="\n"))
            }

            spam.docs <- dir(spam.path)
            spam.docs <- spam.docs[which(spam.docs!="cmds")]
            spam.docs <- paste(spam.path, spam.docs, sep="")
            all.spam.msgs <- sapply(spam.docs, get.msg) - This is the line that throws the above error
adayone commented 11 years ago

you should check if the length(text) >1.

haoyuan hu Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

On Thursday, November 1, 2012 at 11:24 PM, Kingshuk Chatterjee wrote:

Alright, I did this now: (Removed the encoding parameter) get.msg <- function(path) { con <- file(path, open="rt") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(text, collapse="\n")) }
Ran the whole bunch again. The outcome: Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)
So, like I said earlier, the encoding parameter does not seem to have any affect. Again, I am running this on Windows 7 x64. And here is my whole bunch: spam.path <- "datasets/spam/" easyham.path <- "datasets/easy_ham/" hardham.path <- "datasets/hard_ham/"
get.msg <- function(path) { con <- file(path, open="rt") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)] close(con) return(paste(text, collapse="\n")) }
spam.docs <- dir(spam.path) spam.docs <- spam.docs[which(spam.docs!="cmds")] spam.docs <- paste(spam.path, spam.docs, sep="") all.spam.msgs <- sapply(spam.docs, get.msg) - This is the line that throws the above error

— Reply to this email directly or view it on GitHub (https://github.com/johnmyleswhite/ML_for_Hackers/issues/4#issuecomment-9983913).

ceekr commented 11 years ago

Lovely, that works!! Thanks mon. One last question: I see (intermittently) the socket open warning:

             Warning message: closing unused connection 3 (datasets/spam/desktop.ini) 

This I am presuming is because the underlying code failed to close all the File Sockets? It does not happen all the time though.

jamesbconner commented 11 years ago

Is there a permanent fix for this issue? I'm having the same problem. If I remove the encoding on the file(), then the get.msg function will work, but obviously you lose some encoding information.

Using Win 7 (64bit), RStudio 0.96.331, R 2.15.2

almartin82 commented 11 years ago

Can confirm that I am seeing a similar issue as others above - `Error in seq.default(which(text == "")[1] + 1, length(text), 1) : wrong sign in 'by' argument``

Solved by dropping the encoding on con in get.msg. R 3.0.0 on Windows 7, 64 bit.

y1239051 commented 11 years ago

I have problem in following code:

get.msg <- function(path) { con <- file(path, open = "rt", encoding = "latin1") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text == "")[1] + 1, length(text), 1)] close(con) return(paste(msg, collapse = "\n")) }

How can i do , please some body help me!!

y1239051 commented 11 years ago

I want say that if I am not use the parameter for encoding, it's ok for working, but when I key in spam.tdm <- get.tdm(all.spam)

The output error information is following: Error in tolower(txt) : invalid multibyte string 1

Who have same situation? Please help me!!

Thanks

Donnie-Liu commented 10 years ago

I have same issue as y1239051. My system is Win7, 32bit, R version 3.0.2, RStudio Version 0.98.490. However, it seem OK on my old XP system. And,it spent so long time on command "spam.tdm <- get.tdm(all.spam)" that I aborted its running. I will try again.

Donnie-Liu commented 10 years ago

Ooops!, I try XP system again, and get same error!

Donnie-Liu commented 10 years ago

I found a solution following these steps:

  1. Remove "encoding='latin1'" in function get.msg()
  2. In function get.tdm(), add doc.corpus <- tm_map(doc.corpus, function(x) iconv(x, to='UTF-8', sub='byte')) before doc.dtm <- TermDocumentMatrix(doc.corpus, control)

The solution made program run normally. But, the results are a little different.

head(spam.df[with(spam.df,order(-occurrence)),]) term frequency density occurrence 7471 email 813 0.005859586 0.566 18382 please 425 0.003063129 0.508 14339 list 409 0.002947811 0.444 26848 will 828 0.005967697 0.422 2831 body 379 0.002731591 0.408 9124 free 539 0.003884769 0.390

laocan commented 10 years ago

@y1239051 after I changed the function 'get.msg' to {... con <- file(path, open = "rt") ...} and deleted the wrong encoding words(just one sentence) in file:"00136.faa39d8e816c70f23b4bb8758d8a74f0" the command: all.spam <- sapply(spam.docs,

jnjcc commented 10 years ago

For those of you still have this problem, I'd suggest try removing the "open" parameter from file function. It worked for me on R 3.0.3, Win7 x64, and didn't break anything on R 3.1.1, Ubuntu 12.04

okamipride commented 9 years ago

After i correct the encoding parameter to con <- file(path, open = "rt", encoding ="native.enc"), the program can run; however it still show the warning "incomplete final line found on 'data/spam/00136.faa39d8e816c70f23b4bb8758d8a74f0' " in the end of command line. Dose anyone knows what's wrong with this warning ?

bluesilence commented 9 years ago

Hi Donnie @Donnie-Liu,

I tested your solution, however, your change on get.tdm will cause error:

Error: inherits(doc, "TextDocument") is not TRUE

Could you paste the full text of your get.tdm definition?

IbrahimZamit commented 8 years ago

Same thing here okamipride what is the solution to this warning ???

divyanshofficials commented 6 years ago

library(tm) library(ggplot2)

defining paths

spam.path<- "data/spam/" spam2.path<- "data/spam_2/" easyham.path <- "data/easy_ham/" easyham2.path <- "data/easy_ham_2/" hardham.path <- "data/hard_ham/" hardham2.path <- "data/hard_ham_2/"

creating get.msg function

get.msg <- function(path) { con <- file(path, open="rt", encoding="native.enc") text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1]+1,length(text),1)] close(con) return(paste(msg, collapse="\n")) }

creating spam training dataset

spam.docs <- dir(spam.path) spam.docs <- spam.docs[which(spam.docs!="cmds")] all.spam <- sapply(spam.docs,function(p) get.msg(paste(spam.path, p,sep="")))

get.tdm <- function(doc.vec) { doc.corpus <- Corpus(VectorSource(doc.vec)) control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=2) doc.dtm <- TermDocumentMatrix(doc.corpus, control) return(doc.dtm) } spam.tdm <- get.tdm(all.spam)

spam.matrix <- as.matrix(spam.tdm) spam.counts <- rowSums(spam.matrix) spam.df <- data.frame(cbind(names(spam.counts), as.numeric(spam.counts)), stringsAsFactors=FALSE) names(spam.df) <- c("term","frequency") spam.df$frequency <- as.numeric(spam.df$frequency) spam.occurrence <- sapply(1:nrow(spam.matrix), function(i) {length(which(spam.matrix[i,] > 0))/ncol(spam.matrix)}) spam.density <- spam.df$frequency/sum(spam.df$frequency) spam.df <- transform(spam.df, density=spam.density, occurrence=spam.occurrence)

creating easyham.df

easyham.docs <- dir(easyham.path) easyham.docs <- easyham.docs[which(easyham.docs!="cmds")] all.easyham <- sapply(easyham.docs, function(p) get.msg(paste(easyham.path,p,sep="")))[1:500]

get.tdm <- function(doc.vec) { doc.corpus <- Corpus(VectorSource(doc.vec)) control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=2) doc.dtm <- TermDocumentMatrix(doc.corpus, control) return(doc.dtm) } easyham.tdm <- get.tdm(all.easyham)

easyham.matrix <- as.matrix(easyham.tdm) easyham.counts <- rowSums(easyham.matrix) easyham.df <- data.frame(cbind(names(easyham.counts), as.numeric(easyham.counts)), stringsAsFactors=FALSE) names(easyham.df) <- c("term","frequency") easyham.df$frequency <- as.numeric(easyham.df$frequency) easyham.occurrence <- sapply(1:nrow(easyham.matrix), function(i) {length(which(easyham.matrix[i,] > 0))/ncol(spam.matrix)}) easyham.density <- easyham.df$frequency/sum(easyham.df$frequency) easyham.df <- transform(easyham.df, density=easyham.density, occurrence=easyham.occurrence)

creating the classifier

classify.email <- function(path, training.df, prior=0.5, c=1e-6) { msg <- get.msg(path) msg.tdm <- get.tdm(msg) msg.freq <- rowSums(as.matrix(msg.tdm))

Find intersections of words

msg.match <- intersect(names(msg.freq), training.df$term) if(length(msg.match) < 1) { return(priorc^(length(msg.freq))) } else { match.probs <- training.df$occurrence[match(msg.match, training.df$term)] return(prior prod(match.probs) * c^(length(msg.freq)-length(msg.match))) } }

Testing the classifier

hardham.docs <- dir(hardham.path) hardham.docs <- hardham.docs[which(hardham.docs != "cmds")] hardham.spamtest <- sapply(hardham.docs, function(p) classify.email(paste(hardham.path, p, sep=""), training.df=spam.df)) hardham.hamtest <- sapply(hardham.docs, function(p) classify.email(paste(hardham.path, p, sep=""), training.df=easyham.df)) hardham.res <- ifelse(hardham.spamtest > hardham.hamtest, TRUE, FALSE) summary(hardham.res)

use this code in chapter 3. create a code for easyham.df, which is not given in the book. so you can use this complete code with code written for easyham files creation. the encoding is changed from "latin1" to "naive.enc" also, a file in spam folder is corrupted, which is causing the errors. so, better alternative is to delete that file and then run the code.

delete this file - spam/00002.d94f1b97e48ed3b553b3508d116e6a09. also as written in the book, use only first 500 sample mails from the easyham folder for better results.

hope, you found this solution genuine and good enough.