jakevdp / PythonDataScienceHandbook

Python Data Science Handbook: full text in Jupyter Notebooks
http://jakevdp.github.io/PythonDataScienceHandbook
MIT License
43.17k stars 17.93k forks source link

make_pipeline vs. atomic #242

Open jvenepal opened 4 years ago

jvenepal commented 4 years ago

Hello Jake,

In chapter 05.05-Naive-Bayes, there is a set of below shown commands

train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(train.data, train.target)
labels = model.predict(test.data)

They work fine. But when I try to split them into individual commands, I am running into errors with model. predict()

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
trainData = vec.fit_transform(train.data)
testData = vec.fit_transform(test.data)
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(trainData, train.target)
testLables = model.predict(testData)

model.predict(testData) errors out. The error is:
ValueError: dimension mismatch

Do you know what I am doing wrong? Fyi, here is the lengths of my train/test data

In[12]: print(len(train.data), len(train.target), len(test.data), len(test.target))
Out[12]: 4528 4528 3012 3012

If the mismatch of lengths of train.data and test.data is the cause, I am wondering, why make_pipeline didn't run into the same problem.