Issues with TfidnVectorizer

Hey, great tool.

I have a problem though when I am trying to use a TfidfVectorizer for Text Classification. When I create a Single Base Learner I get the error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly .

The type of the X variable is an numpy.ndarray, but if I don't convert the variable X to an array then I get the error message:

TypeError: Singleton array array(<92820x194 sparse matrix of type '<class 'numpy.float64'>' with 92820 stored elements in Compressed Sparse Row format>, dtype=object) cannot be considered a valid collection.

I choose the preset learner setting scikit-learn Random Forest as a Base Learner Type.

import os
import numpy as np
import pandas as pd
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

def extract_main_dataset():
    # pandas data frame with the columns Classification, FeatureVector
    # ie:
    # 0, 'This is the feature vector'
    # 1, 'This is another feature vector' 
    # 2, 'This is yet another feature vector' 
    # 1, 'This is the last feature vector example' 
    with open('feature_vector.pik', 'rb') as rf:
        feature_vector = pickle.load(rf)

    y = np.array(feature_vector.Classification.values)
    title_rf_vectorizer = TfidfVectorizer(ngram_range=(2, 9),
                                          sublinear_tf=True,
                                          use_idf=True,
                                          strip_accents='ascii')

    title_rf_classifier = RandomForestClassifier(n_estimators=100, n_jobs=8)
    X = title_rf_vectorizer.fit_transform(feature_vector["Classification"]).toarray()
    return X, y

reiinakano / xcessiv

Issues with TfidnVectorizer #44