koalaverse / homlr

Supplementary material for Hands-On Machine Learning with R, an applied book covering the fundamentals of machine learning with R.
https://koalaverse.github.io/homlr
Creative Commons Attribution Share Alike 4.0 International
229 stars 88 forks source link

Code for Chapter 8 not working #50

Open RaymondBalise opened 3 years ago

RaymondBalise commented 3 years ago

I attempted to use the read_mnist() function from dslabs and it returned this error:

Error in readBin(conn, "integer", n = prod(dim), size = 1, signed = FALSE) : 
  cannot read from connection
In addition: Warning message:
In readBin(conn, "integer", n = prod(dim), size = 1, signed = FALSE) :
  URL 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz': Timeout of 60 seconds was reached

It looks like http://yann.lecun.com/exdb/mnist/ is no longer live but with a little help from the Brave browser I found an old image of the site using the wayback machine and I downloaded the files.

I modified the function to read the data out of local copies:

read_mnist_local <- function () {
    mnist <- list(train = list(images = c(), labels = c()), test = list(images = c(), 
        labels = c()))
    for (ttt in c("train", "t10k")) {
        fn <- paste0(ttt, "-images-idx3-ubyte.gz")
        # url <- url(paste0("", fn), "rb")
        # conn <- gzcon(url)
        conn <- gzcon(file(fn, "rb"))
        magic <- readBin(conn, "integer", n = 1, size = 4, endian = "big")
        typ <- bitwAnd(bitwShiftR(magic, 8), 255)
        ndm <- bitwAnd(magic, 255)
        dim <- readBin(conn, "integer", n = ndm, size = 4, endian = "big")
        data <- readBin(conn, "integer", n = prod(dim), size = 1, 
            signed = FALSE)
        tt <- ttt
        if (tt == "t10k") 
            tt <- "test"
        mmm <- matrix(data, nrow = dim[1], byrow = TRUE)
        mnist[[tt]][["images"]] <- mmm
        close(conn)
        fn <- paste0(ttt, "-labels-idx1-ubyte.gz")
        # url <- url(paste0("", fn), "rb")
        # conn <- gzcon(url)
        conn <- gzcon(file(fn, "rb"))
        magic <- readBin(conn, "integer", n = 1, size = 4, endian = "big")
        nlb <- readBin(conn, "integer", n = 1, size = 4, endian = "big")
        data <- readBin(conn, "integer", n = nlb, size = 1, signed = FALSE)
        mnist[[tt]][["labels"]] <- data
        close(conn)
    }
    mnist
}

# import MNIST training data
#mnist <- dslabs::read_mnist()
mnist <- read_mnist_local()

… and all is good. I don’t know the proper solution (other than hosting the files) but I figured I should share this in the hope it helps others.

bradleyboehmke commented 3 years ago

@RaymondBalise, here is another slightly simpler approach. This uses the mnist data set provided by Keras. We were going to use this data set initially but decided not to since it is in a 3D array rather than the 2D dataframe provided by dslabs::read_mnist().

# Import MNIST data from Keras. This will import the data
# as a 3D array
mnist <- keras::dataset_mnist()

# Get our feature dimensions
mnist_train_dim <- dim(mnist$train$x)
train_nobs <- mnist_train_dim[1]
train_nfeat <- mnist_train_dim[2]*mnist_train_dim[3]

# Identify our sampled index
set.seed(123)
index <- sample(train_nobs, size = 10000)

# Convert features to 2D array, then to a dataframe
mnist_x_2d <- array(mnist$train$x, dim = c(train_nobs, train_nfeat))
mnist_x <- data.frame(mnist_x_2d)[index, ]

# extract response and convert to factor
mnist_y <- factor(mnist$train$y)[index]