rasbt / python-machine-learning-book

The "Python Machine Learning (1st edition)" book code repository and info resource
MIT License
12.24k stars 4.4k forks source link

AttributeError: Can't get attribute 'tokenizer_porter' on <module '__main__' (built-in)> #50

Closed BingKong1988 closed 7 years ago

BingKong1988 commented 7 years ago

trying to run gs_lr_tfidf.fit(X_train, y_train) and got the AttributeError

Running on jupyter notebook, python 3.5

rasbt commented 7 years ago

Hi, was this an issue that occurred in the Chapter 08 code notebook (https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch08/ch08.ipynb) in the section "Training a logistic regression model for document classification"? I just tried to rerun the notebook but couldn't reproduce the problem. Could you maybe provide more information about how you ran the code?

The software version number the Notebook was last executed in are

CPython 3.5.2
IPython 5.1.0

numpy 1.11.1
pandas 0.18.1
matplotlib 1.5.1
sklearn 0.18
nltk 3.2.1

Now, I ran it with the latest versions and it also seems to be fine:

CPython 3.6.1
IPython 6.0.0

numpy 1.13.0
pandas 0.20.1
matplotlib 2.0.2
sklearn 0.18.1
nltk 3.2.4

Could you maybe try to update your packages if this is not too much hassle

BingKong1988 commented 7 years ago

Hi Sebastian, Thanks for your reply. I have updated all packages and rerun the code. The code is basically a copy of your notebook in the section "Training a logistic regression model for document classification". After running gs_lr_tfidf.fit(X_train, y_train), it's stuck. Since I have changed verbose from 1 to 10, I got the following msg:

Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer' on <module 'main' (built-in)> Process SpawnPoolWorker-1: Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, *self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer_porter' on <module 'main' (built-in)> Process SpawnPoolWorker-5: Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(self._args, self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer_porter' on <module 'main' (built-in)> Process SpawnPoolWorker-6: Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer_porter' on <module 'main' (built-in)> Process SpawnPoolWorker-9: Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, *self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer_porter' on <module 'main' (built-in)> Process SpawnPoolWorker-10: Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(self._args, self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer_porter' on <module 'main' (built-in)> Process SpawnPoolWorker-11: Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer' on <module 'main' (built-in)> Process SpawnPoolWorker-12: Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, *self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer' on <module 'main' (built-in)> Process SpawnPoolWorker-13: Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(self._args, self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer' on <module 'main' (built-in)> Process SpawnPoolWorker-14: Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer' on <module 'main' (built-in)> Process SpawnPoolWorker-15: Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, *self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer' on <module 'main' (built-in)> Process SpawnPoolWorker-16: Traceback (most recent call last): File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(self._args, self._kwargs) File "C:\Users\Bing\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() File "C:\Users\Bing\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 362, in get return recv() File "C:\Users\Bing\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) AttributeError: Can't get attribute 'tokenizer_porter' on <module 'main' (built-in)>

rasbt commented 7 years ago

Oh I think I know what might be going on. Based on the scikit-learn mailing lists, several people have had issues with multiprocessing on Windows in the past. So one thing you could try is to set n_jobs=1 instead of n_jobs=-1. Pls let me know if this solves the problem, I will add a note to the notebook to warn other readers as well.

PS: To make it finish faster, you may want to reduce the size of the parameter grid, for instance using the following:

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [None],
               'vect__tokenizer': [tokenizer],
               'clf__penalty': ['l2'],
               'clf__C': [10.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [None],
               'vect__tokenizer': [tokenizer],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l2'],
               'clf__C': [10.0,]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=1)
BingKong1988 commented 7 years ago

now it is working perfectly. Thank you!

rasbt commented 7 years ago

Glad to hear that it works now! I added a note to the notebooks in case other people have the same problem.

akhilcj90 commented 1 year ago

AttributeError: Can't get attribute 'WordpieceTokenizer' on <module 'main' why this error is coming?

Walkmao commented 1 year ago

你好,邮件已经收到,将尽快给你回复。