rasbt / machine-learning-book

Code Repository for Machine Learning with PyTorch and Scikit-Learn
https://sebastianraschka.com/books/#machine-learning-with-pytorch-and-scikit-learn
MIT License
3.64k stars 1.31k forks source link

Ch8 improving code to reprocessing the movie dataset into more convenienct format #74

Closed mzakariaCERN closed 2 years ago

mzakariaCERN commented 2 years ago

When running the code on jupyter notebooks there were 2 isses:

  1. the status bar didn't show (this is fixed by setting the stream to 2 in ProgBar
  2. you get a warning the "append" is deprecated and we should transition to "concat"

The code below fixes both issues. Let me know If you want me to make a pull request :)


import pyprind
import pandas as pd
import os
import sys

# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
#pbar = pyprind.ProgBar(50000, stream=sys.stdout)
pbar = pyprind.ProgBar(50000, stream=2)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
            x = pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
            df = pd.concat([df,x], ignore_index=True)
            pbar.update()
#df.columns = ['review', 'sentiment']
rasbt commented 2 years ago

Thanks for the note. Just updated it: I added the stream as a comment because the sys.stdout works for me. It's maybe an operating system or Jupyter version thing.

Regarding the DeprecationError, I also updated that one for the more recent pandas version