rasbt / python-machine-learning-book

The "Python Machine Learning (1st edition)" book code repository and info resource
MIT License
12.18k stars 4.39k forks source link

typo in chapter 12 #58

Closed novel-yet-trivial closed 6 years ago

novel-yet-trivial commented 6 years ago

In this file, the code loads the names as

labels_path = os.path.join(path, 
                           '%s-labels-idx1-ubyte' % kind)
images_path = os.path.join(path, 
                           '%s-images-idx3-ubyte' % kind)

However the linked .gz file has the names with a period, not a hyphen. It should be

labels_path = os.path.join(path, 
                           '%s-labels.idx1-ubyte' % kind)
images_path = os.path.join(path, 
                           '%s-images.idx3-ubyte' % kind)

https://www.reddit.com/r/learnpython/comments/6qc9t1/path_to_existing_file_in_root_folder_not_found_on/

rasbt commented 6 years ago

Thanks for the note, but I think the current spelling of the file is correct. For instance, if you go to the original resource for this dataset, http://yann.lecun.com/exdb/mnist/, you see that the files are spelled as follows:

train-images-idx3-ubyte.gz:  training set images (9912422 bytes) 
train-labels-idx1-ubyte.gz:  training set labels (28881 bytes) 
t10k-images-idx3-ubyte.gz:   test set images (1648877 bytes) 
t10k-labels-idx1-ubyte.gz:   test set labels (4542 bytes)

Then, when I unzip the files, the file names are still the same but with the ".gz" suffix removed, I.e.,

train-images-idx3-ubyte
train-labels-idx1-ubyte
t10k-images-idx3-ubyte
t10k-labels-idx1-ubyte

Did you maybe manually relabel the files by accident?

novel-yet-trivial commented 6 years ago

Hmmm. When I decompress with the gzip command line utility as per your instructions, then the file names are correct. However if I decompress with the Gnome Archive Manager (GUI), the file names have a period in them.

novel-yet-trivial commented 6 years ago

I suggest you forget the decompression step and just directly access the compressed files from python, thereby circumventing the filename problem:

import gzip
#...
def load_mnist(path, kind='train'):
    labels_path = os.path.join(path, 
                               '%s-labels-idx1-ubyte.gz' % kind)
    images_path = os.path.join(path, 
                               '%s-images-idx3-ubyte.gz' % kind)

    with gzip.open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II', 
                                 lbpath.read(8))
rasbt commented 6 years ago

That's good to know; wouldn't have thought that certain tools might be renaming the files. Regarding the code example above, it doesn't work (I think I experimented a lot with loading it directly via gzip back then, but I couldn't get it to work). Will do some more experiments and upload a fixed version -- I like that idea, thanks!

rasbt commented 6 years ago

Turns out the following does the trick

import os
import struct
import numpy as np
import gzip

def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path, 
                               '%s-labels-idx1-ubyte.gz' % kind)
    images_path = os.path.join(path, 
                               '%s-images-idx3-ubyte.gz' % kind)

    with gzip.open(labels_path, 'rb') as lbpath:
        lbpath.read(8)
        buffer = lbpath.read()
        labels = np.frombuffer(buffer, dtype=np.uint8)

    with gzip.open(images_path, 'rb') as imgpath:
        imgpath.read(16)
        buffer = imgpath.read()
        images = np.frombuffer(buffer, 
                               dtype=np.uint8).reshape(
            len(labels), 784)

    return images, labels

will add it to the code notebook shortly

rasbt commented 6 years ago

Should be fixed now in the Ch12 notebook!