Closed ghost closed 6 years ago
Hi, The 'Type' column does not only contain 1, but the values 1, 2, 3, 5, 6, 7. It is the column containing the label each entry belongs to.
So in the code you'll have to replace Type
in y_col = 'Type'
with the column name in your dataset containing the class-labels.
Hi, thanks for the reply. I just realised where my problem was, it has to do with the way in which my class label column was generated. Thanks for looking out.
I came across your jupyter notebook and was pleased to find solutions to a problem that had been giving me headaches, that is, classification of data from a dataframe with columns that have numeric attributes. I have data that is similar to yours and I modified your code for my dataset but its not working. Your data has a column labelled "Type", which is just an array of ones.
Whenever I run your code on my dataset, I get the following error: ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: '1'
Do you know why this error is coming up in my case when it wouldn't in your case ? I also tried out the code from your webpage which differs from the one here on github on the following line: website code: mask = mask = np.random.rand(len(df)) < ratio (error comes up because lt is not defined anywhere in the code) github code :mask = np.random.rand(len(df)) < ratio
When I run the code thats given on your website, and make the above change(removing <, ratio and adding <, the error changes to KeyError: "Type"
Do you know how I can solve this ? Thanks for the help in advance
Here is my code for the dataframe preprocessing diffreport.txt
import warnings; warnings.simplefilter("ignore")
importing important libraries
%matplotlib inline import matplotlib.pyplot as plt import seaborn as sns import numpy as np import pandas as pd import scipy.stats as stats import statsmodels.formula.api as sm from statsmodels.formula.api import ols from statsmodels.stats.anova import anova_lm import csv
df = pd.read_csv("diffreport.csv", sep= ",")
d1 = df.drop("name", axis = 1) d2 = d1.drop("isotopes", axis = 1) d3 = d2.drop("adduct", axis = 1) d4 = d3.drop("tstat", axis = 1) d5 = d4.drop("pvalue", axis = 1) d6 = d5.drop("fold", axis = 1) d7 = d6.drop(d6.columns[0], axis = 1) d8 = d7.drop("npeaks", axis = 1) d9 = d8.drop("Eta6", axis = 1) d10 = d9.drop("Eta8", axis = 1) columns = ['Eta6_0', 'Eta6_2', 'Eta6_3', 'Eta8.1', 'Eta82', 'Eta83'] df1 = pd.DataFrame(d10, columns = columns) df1['Type'] = "1"
The rest of my code is similar to yours but I have pasted it below for clarity import time import pandas as pd import numpy as np
import pickle
Some modules for plotting and visualizing
import seaborn as sns import matplotlib.pyplot as plt from IPython.display import display
And some Machine Learning modules from scikit-learn
from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn import tree from sklearn.neural_network import MLPClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.gaussian_process.kernels import RBF from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import GaussianNB
dict_classifiers = { "Logistic Regression": LogisticRegression(), "Nearest Neighbors": KNeighborsClassifier(), "Linear SVM": SVC(), "Gradient Boosting Classifier": GradientBoostingClassifier(n_estimators=1000), "Decision Tree": tree.DecisionTreeClassifier(), "Random Forest": RandomForestClassifier(n_estimators=1000), "Neural Net": MLPClassifier(alpha = 1), "Naive Bayes": GaussianNB(),
"AdaBoost": AdaBoostClassifier(),
}
def batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 5, verbose = True):
def label_encode(df, list_columns): """ This method one-hot encodes all column, specified in list_columns
def expand_columns(df, list_columns): for col in list_columns: colvalues = df[col].unique() for colvalue in colvalues: newcol_name = "{}is{}".format(col, colvalue) df.loc[df[col] == colvalue, newcol_name] = 1 df.loc[df[col] != colvalue, newcol_name] = 0 df.drop(list_columns, inplace=True, axis=1)
def get_train_test(df, y_col, x_cols, ratio): """ This method transforms a dataframe into a train and test set, for this you need to specify:
the column with the Y_values """ mask = np.random.rand(len(df)) < ratio df_train = df[mask] df_test = df[~mask]
Y_train = df_train[y_col].values Y_test = df_test[y_col].values X_train = df_train[x_cols].values X_test = df_test[x_cols].values return df_train, df_test, X_train, Y_train, X_test, Y_test
def display_dict_models(dict_models, sort_by='test_score'): cls = [key for key in dict_models.keys()] test_s = [dict_models[key]['test_score'] for key in cls] training_s = [dict_models[key]['train_score'] for key in cls] training_t = [dict_models[key]['train_time'] for key in cls]
def display_corr_with_col(df, col): correlation_matrix = df.corr() correlation_type = correlation_matrix[col].copy() abs_correlation_type = correlation_type.apply(lambda x: abs(x)) desc_corr_values = abs_correlation_type.sort_values(ascending=False) y_values = list(desc_corr_values.values)[1:] x_values = range(0,len(y_values)) xlabels = list(desc_corr_values.keys())[1:] fig, ax = plt.subplots(figsize=(8,8)) ax.bar(x_values, y_values) ax.set_title('The correlation of all features with {}'.format(col), fontsize=20) ax.set_ylabel('Pearson correlatie coefficient [abs waarde]', fontsize=16) plt.xticks(x_values, xlabels, rotation='vertical') plt.show()
Classification
y_col_glass = 'Type' x_cols_glass = list(df1.columns.values) x_cols_glass.remove(y_col_glass)
train_test_ratio = 0.7 df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df1, y_col_glass, x_cols_glass, train_test_ratio)
dict_models = batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 8) display_dict_models(dict_models)