reiinakano / xcessiv

A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling in Python.
http://xcessiv.readthedocs.io
Apache License 2.0
1.27k stars 105 forks source link

Issues with TfidnVectorizer #44

Closed bbowler86 closed 7 years ago

bbowler86 commented 7 years ago

Hey, great tool.

I have a problem though when I am trying to use a TfidfVectorizer for Text Classification. When I create a Single Base Learner I get the error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly .

The type of the X variable is an numpy.ndarray, but if I don't convert the variable X to an array then I get the error message:

TypeError: Singleton array array(<92820x194 sparse matrix of type '<class 'numpy.float64'>' with 92820 stored elements in Compressed Sparse Row format>, dtype=object) cannot be considered a valid collection.

I choose the preset learner setting scikit-learn Random Forest as a Base Learner Type.

import os
import numpy as np
import pandas as pd
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

def extract_main_dataset():
    # pandas data frame with the columns Classification, FeatureVector
    # ie:
    # 0, 'This is the feature vector'
    # 1, 'This is another feature vector' 
    # 2, 'This is yet another feature vector' 
    # 1, 'This is the last feature vector example' 
    with open('feature_vector.pik', 'rb') as rf:
        feature_vector = pickle.load(rf)

    y = np.array(feature_vector.Classification.values)
    title_rf_vectorizer = TfidfVectorizer(ngram_range=(2, 9),
                                          sublinear_tf=True,
                                          use_idf=True,
                                          strip_accents='ascii')

    title_rf_classifier = RandomForestClassifier(n_estimators=100, n_jobs=8)
    X = title_rf_vectorizer.fit_transform(feature_vector["Classification"]).toarray()
    return X, y
bbowler86 commented 7 years ago

I solved my own problem, I just changed the second to last line from:

X = title_rf_vectorizer.fit_transform(feature_vector["Classification"]).toarray()

to

X = title_rf_vectorizer.fit_transform(feature_vector["FeatureVector"]).toarray()

Sorry about that. You can consider this issue closed.