pylablanche / gcForest

Python implementation of deep forest method : gcForest
MIT License
417 stars 193 forks source link

ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). #2

Closed chibohe closed 6 years ago

chibohe commented 7 years ago

The data is from UCI.Here is the link.http://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29 Here is my code.

` data_dir = '../census_income.data'

df = pd.read_table(data_dir,sep=',',header=-1)

df[41][df[41]==' 50000+.']=1 df[41][df[41]==' - 50000.']=0

y_tag = 41 pos_value = 1 neg_value = 0 y = df[y_tag].values y = y.astype(float32) del df[y_tag]

for c in df.columns: if df[c].dtype == 'object': lbl = preprocessing.LabelEncoder() lbl.fit(list(df[c].values)) df[c] = lbl.transform(list(df[c].values))

mmsc = MinMaxScaler() for i in df.columns: df[i] = mmsc.fit_transform(df[i])

df = df.astype(float32)

df = df.fillna(df.median(axis=0))

X = df.values

X_train, X_test, y_train, y_test =train_test_split(np.nan_to_num(X),y,test_size = 0.3,random_state=123)

gcf_param={'shape_1X': X.shape[1], 'window':[1], 'n_mgsRFtree':30, 'stride':1, 'cascade_test_size':0.2, 'n_cascadeRF':2, 'n_cascadeRFtree':101, 'cascade_layer':100, 'min_samples_mgs':0.1, 'min_samples_cascade':0.05, 'tolerance':0.0, 'n_jobs':1 }

gcf=gcForest(**gcf_param)

start_time=datetime.datetime.now()

gcf.fit(X_train, y_train)

end_time = datetime.datetime.now()

cost_time = end_time-start_time

cost_time = int(cost_time.seconds) ` The error raises when it comes to the 'gcf.fit(X_train,y_train)',but there is no NA and inf in the data,so I wonder where the problem is.

pylablanche commented 7 years ago

@chibohe I just wanted to let you know that I haven't had much time to look at your issue lately but I'll be able to look at it next week. Have you been able to solve it or is it still a problem?

pylablanche commented 7 years ago

Hey @chibohe

I have tried to reproduce your results and here is my code :

import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from GCForest import gcForest

lbl = LabelEncoder()
mmsc = MinMaxScaler()

df = pd.read_table('adult.data',sep=',',header=-1)

df[14].replace({' >50K':1, ' <=50K':0}, inplace=True)

y_tag = 14
y = df[y_tag].values
y = y.astype('float32')
del df[y_tag]

for c in df.columns:
    if df[c].dtype == 'object':
        lbl.fit(list(df[c].values))
        df[c] = lbl.transform(list(df[c].values))

########### WARNINGS and deprecation Below!!!!
for i in df.columns:
    df[i] = mmsc.fit_transform(df[i])
###########

#df = df.astype(float32)
df = df.fillna(df.median(axis=0))
X = df.values
X_train, X_test, y_train, y_test = train_test_split(np.nan_to_num(X),y,test_size = 0.3,random_state=123)

gcf_param={'shape_1X': X.shape[1],
'window':[1],
'n_mgsRFtree':30,
'stride':1,
'cascade_test_size':0.2,
'n_cascadeRF':2,
'n_cascadeRFtree':101,
'cascade_layer':100,
'min_samples_mgs':0.1,
'min_samples_cascade':0.05,
'tolerance':0.0,
'n_jobs':1
}

gcf=gcForest(**gcf_param)
gcf.fit(X_train, y_train)

And it runs smoothly without any error.

That make me think it is either a problem of libraries a bit too old or a hardware problem, i.e. running out of memory or something similar. How much memory available do you have ?

chibohe commented 7 years ago

Thank you for your reply. Now it can run smoothly on my PC,maybe I didn't fill the NA before:)

pylablanche commented 7 years ago

@chibohe Glad I could help. Honestly I can't really spot where there was any problem in your code except maybe here :

df[41][df[41]==' 50000+.']=1
df[41][df[41]==' - 50000.']=0

but I'm not even sure. Feel free to contact me again if you face any more difficulties!

maysam19 commented 6 years ago

@pylablanche

Hello,

I am having same issue and wondering if you can help me with code. I am getting same error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Any idea what could be the issue? I really appreciate your help. My full code is below:

import os import numpy as np import pandas as pd import pickle import quandl from datetime import datetime import seaborn as sns import plotly.offline as py import matplotlib.pyplot as plt import plotly.graph_objs as go import plotly.figure_factory as ff py.init_notebook_mode(connected=True) import quandl lags = 5 start_test = pd.to_datetime('2017-03-01') from sklearn.ensemble import RandomForestClassifier as RFC from sklearn.svm import SVC as SVC from math import sqrt

Since it’s a classifier, we need to create classes for each line: 1 if the future went up today, -1 if it went down or stayed the same.

def computeClassification(actual): if(actual > 0): return 1 else: return -1

DATA IMPORTING

quandl.ApiConfig.api_key = os.environ["daQg3mGnaMeP2JDH5swh"]

quandl.ApiConfig.api_key = "daQg3mGnaMeP2JDH5swh"

pull btc/usd rate from quandl off of bitstamp exchange

df = quandl.get(['BCHARTS/BITSTAMPUSD.4'], start_date = "2011-09-13", end_date = "2017-12-18")

df.rename(columns={'BCHARTS/BITSTAMPUSD - Close': 'Close'}, inplace=True)

print(df.head())

SIGNALS

df['Stdev'] = df['Close'].rolling(window=90).std() # calculate rolling 90 day std df['SMA'] = df['Close'].rolling(50).mean() # calculate 50 day SMA

calculate daily returns

df['returns'] = np.log(df['Close'] / df['Close'].shift(1)) df['returns'].fillna(0) df['returns_1'] = df['returns'].fillna(0) df['returns_2'] = df['returns_1'].replace([np.inf, -np.inf], np.nan) df['returns_final'] = df['returns_2'].fillna(0) print(df['returns_final'])

ts = df ts.index = pd.to_datetime(ts.index) tslag = ts.copy()

for i in range(0, lags): tslag["Lag_" + str(i + 1)] = tslag["Close"].shift(i + 1) tslag["returns_final"] = tslag["Close"].pct_change()

Create the lagged percentage returns columns

for i in range(0, lags): tslag["Lag" + str(i + 1)] = tslag["Lag" + str(i + 1)].pct_change() tslag.fillna(0, inplace=True)

tslag["Direction"] = np.sign(tslag["returns_final"])

Use the prior two days of returns as predictor values, with direction as the response

X = tslag[["Lag_1", "Lag_2"]] y = tslag["Direction"]

Create training and test sets

X_train = X[X.index < start_test] X_test = X[X.index >= start_test] y_train = y[y.index < start_test] y_test = y[y.index >= start_test]

Create prediction DataFrame

pred = pd.DataFrame(index=y_test.index)

svc = SVC() # import SVC # import RFC svc.fit(X_train, y_train) y_pred = svc.predict(X_test) # predict y based on x_test

pred = (1.0 + y_pred * y_test)/2.0

pred = (1.0 + (y_pred == y_test)) / 2.0 hit_rate = np.mean(pred) print('SVC {:.4f}'.format(hit_rate)) print(pred)

CALCULATIONS FOR MEAN SQUARED ERROR (MSE)

Create a decision tree regressor and fit it to the training set

regressor = SVC() regressor.fit(X_train, y_train)

Evaluate the model: evaluate performance of the model (mean squared error shown below)

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_pred, regressor.predict(X_test)) print("MSE: %.4f" % mse)

df['strategy'] = pred * df['returns_final'] # however cumulative performance of the strategy df[['returns', 'strategy']].ix[lags:].cumsum().apply(np.exp).plot( figsize=(10, 6))

plt.show()

maysam19 commented 6 years ago
import os
import numpy as np
import pandas as pd
import pickle
import quandl
from datetime import datetime
import seaborn as sns
import plotly.offline as py
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.figure_factory as ff
py.init_notebook_mode(connected=True)
import quandl
lags = 5
start_test = pd.to_datetime('2017-03-01')
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.svm import SVC as SVC
from math import sqrt

# Since it’s a classifier, we need to create classes for each line: 1 if the future went up today, -1 if it went down or stayed the same.

def computeClassification(actual):
    if(actual > 0):
        return 1
    else:
        return -1

#DATA IMPORTING

#quandl.ApiConfig.api_key = os.environ["daQg3mGnaMeP2JDH5swh"]
quandl.ApiConfig.api_key = "daQg3mGnaMeP2JDH5swh"

# pull btc/usd rate from quandl off of bitstamp exchange
df = quandl.get(['BCHARTS/BITSTAMPUSD.4'], start_date = "2011-09-13", end_date = "2017-12-18")  ######### REMEMBER to have it at .4 if nyguyens, and remove if grgs

df.rename(columns={'BCHARTS/BITSTAMPUSD - Close': 'Close'}, inplace=True)

print(df.head())

# SIGNALS
df['Stdev'] = df['Close'].rolling(window=90).std() # calculate rolling 90 day std
df['SMA'] = df['Close'].rolling(50).mean()  # calculate 50 day SMA

# calculate daily returns
df['returns'] = np.log(df['Close'] / df['Close'].shift(1))
df['returns'].fillna(0)
df['returns_1'] = df['returns'].fillna(0)
df['returns_2'] = df['returns_1'].replace([np.inf, -np.inf], np.nan)
df['returns_final'] = df['returns_2'].fillna(0)
print(df['returns_final'])

ts = df
ts.index = pd.to_datetime(ts.index)
tslag = ts.copy()

for i in range(0, lags):
    tslag["Lag_" + str(i + 1)] = tslag["Close"].shift(i + 1)
tslag["returns_final"] = tslag["Close"].pct_change()

# Create the lagged percentage returns columns
for i in range(0, lags):
    tslag["Lag_" + str(i + 1)] = tslag["Lag_" + str(i + 1)].pct_change()
tslag.fillna(0, inplace=True)

tslag["Direction"] = np.sign(tslag["returns_final"])
# Use the prior two days of returns as predictor values, with direction as the response
X = tslag[["Lag_1", "Lag_2"]]
y = tslag["Direction"]

# Create training and test sets
X_train = X[X.index < start_test]
X_test = X[X.index >= start_test]
y_train = y[y.index < start_test]
y_test = y[y.index >= start_test]

# Create prediction DataFrame
pred = pd.DataFrame(index=y_test.index)

svc = SVC()  # import SVC  # import RFC
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)  # predict y based on x_test

# pred = (1.0 + y_pred * y_test)/2.0
pred = (1.0 + (y_pred == y_test)) / 2.0
hit_rate = np.mean(pred)
print('SVC {:.4f}'.format(hit_rate))
print(pred)

# CALCULATIONS FOR MEAN SQUARED ERROR (MSE)

# Create a decision tree regressor and fit it to the training set
regressor = SVC()
regressor.fit(X_train, y_train)

# Evaluate the model: evaluate performance of the model (mean squared error shown below)
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_pred, regressor.predict(X_test))
print("MSE: %.4f" % mse)

df['strategy'] = pred * df['returns_final']  # however cumulative performance of the strategy
df[['returns', 'strategy']].ix[lags:].cumsum().apply(np.exp).plot(
    figsize=(10, 6))

plt.show()
maysam19 commented 6 years ago

sorry resent code in proper format for you to view easily. thanks.........

maysam19 commented 6 years ago

NEver mind you can disregard

figured it out thanks

pylablanche commented 6 years ago

@maysam19 , Great to hear that (I was about to look at your problem).

If I may ask, what was the problem and at what line in your code ?

kingfengji commented 6 years ago

I believe the code has something to do with quantitive trading. i.e. classifying a sequence of time series data into buy or sell. (since he imported the quandl module, and a lot of technical indicators such as SMAs )

pylablanche commented 6 years ago

@kingfengji That was my guess too! The classification tag made it pretty obvious. I was more curious to know if it was an error like missing values or wrong data type. :)

kingfengji commented 6 years ago

@pylablanche haha...you are probably right,I think... @maysam19 for these kind of data, class imbalance is an important issue (most of the time the stock price wont rise sharply) needed to be taken care of, so you need to set the sample weights for base estimators so as to take the imbalance issue into account.

saeed344 commented 4 years ago

Hello,

I am having the same issue and wondering if you can help me with code. I am getting the same error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). i am using gcforset classifier. The error is found end of code, i just simple pass rowXcolum(row=number of sample, column=feature data). When i pass data to gcforest classifier give me that issue. any idea what could be the issue? I really appreciate your help. My full code is below:

import sys import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder, MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix, accuracy_score from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectKBest from sklearn.model_selection import StratifiedKFold from sklearn.metrics import roc_curve, auc

from gcForest.GCForest import gcForest

from GCForest import gcForest from sklearn.preprocessing import scale,StandardScaler

from load_data import load_data

from baseline import TitleFinder

from baseline_author import AuthorFinder

from constants import *

Step 01 : Load the dataset :

iRec = 'SVM_CBR_bestfirst_Hspeice.csv' df = pd.read_csv(iRec, header=None).fillna(0) # Using pandas

___

data=scale(df) from sklearn import preprocessing data=preprocessing.normalize(data) label1=np.ones((495,1)) #Value can be changed label2=np.zeros((495,1)) label=np.append(label1,label2) X=data y=label

Step 02 : Divide features (X) and classes (y) :

___

def main():

X = data.iloc[:, :-1].values

#y = D.iloc[:, -1].values
#X, y, tfidf = load_data()

# Number of Features
print('Using ", NUM_FEATURES, "Features based on tf-idf')

# feature selection to make the problem tractable for gcforest

fs = SelectKBest(k=31)

X = fs.fit_transform(X,y)

#X = np.asarray(X.toarry())

X=np.asarray(X)

#X, _, y, _ = train_test_split(X, y, train_size=0.3, random_state=1330, stratify=y)

possibleNumTrees = [50, 100, 200,250,300,350,400,450]
possibleNumForests = [2, 4, 6,8,10,12,14]

bestAccuracy = -float("inf")
bestNumTrees = 0
bestNumForests = 0

folds = StratifiedKFold(n_splits=5, shuffle=False, random_state=None)#random_state=1330)
for numForests in possibleNumForests:
    for numTrees in possibleNumTrees:
        #print('Now testing numForests=%d, numTrees=%d" % (numForests, numTrees))
        scores = []
        for train_index, test_index in folds.split(X, y):
            model = gcForest(shape_1X=30,window=5, n_cascadeRF=numForests, n_cascadeRFtree=numTrees, n_jobs=-1)
            X_train, X_test = X[train_index, :], X[test_index, :]
            y_train, y_test = y[train_index], y[test_index]
            #X_train.fillna(X_train.mean())
            #X_test.fillna(X_test.mean())
            #X_train = train_df.fillna(method='ffill').values
            #X_test = train_df.fillna(method='ffill').values
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            scores.append(accuracy_score(y_test, y_pred))
        print('Cross validation scores:', scores)
        accuracy = np.mean(scores)

        if accuracy > bestAccuracy:
            bestAccuracy = accuracy
            bestNumTrees = numTrees
            bestNumForests = numForests

print("Best Accuracy = ", bestAccuracy)
print("best Num Forests =", bestNumForests)
print("Best Num Trees =", bestNumTrees)

if name == 'main': main() ***error** File "C:\Users\saeed\Miniconda3\lib\site-packages\sklearn\utils\validation.py", line 56, in _assert_all_finite raise ValueError(msg_err.format(type_err, X.dtype)) ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Process finished with exit code 1

saeed344 commented 4 years ago

@pylablanche please check my issue i data is attached with this message

- SVM_CBR_bestfirst_Hspeice.xlsx

ishikaahuja commented 4 years ago

-- coding: utf-8 --

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split as tts from sklearn.model_selection import KFold

data=pd.read_csv(r"C:\Users\ISHIKA\4th\household_power_consumption\household_power_consumption.txt",delimiter=";")

print(data.head())

data.drop(["Date","Time"],axis=1,inplace=True)

cols=data.columns data[cols]=data[cols].replace(["?"],[None]) data=data.fillna(data.mean(axis=1))

data.replace(["?"],[data.mean()],inplace=True,axis=1)

print(data.head())

x=data.drop(['Global_active_power'],axis=1)

y=data[['Global_active_power']]

x_train,x_test,y_train,y_test=tts(x,y,train_size=0.7,random_state=200) x_train=x_train.to_numpy() x_test=x_test.to_numpy() y_train=y_test.to_numpy() y_test=y_test.to_numpy() from sklearn.preprocessing import StandardScaler

x_train=x_train.to_numpy()

x=x.astype(float)

scaler=StandardScaler() scaler.fit(x_train) scaler.fit(y_train) scaler.fit(x_test) scaler.fit(y_test) x_train=scaler.transform(x_train) y_train=scaler.transform(y_train) x_test=scaler.transform(x_test) y_test=scaler.transform(y_test)

t=Normalizer()

x=t.transform(x)

np.nan_to_num(x_train) np.nan_to_num(x_test)

from sklearn.preprocessing import PolynomialFeatures pr=PolynomialFeatures(degree=4,include_bias=True) x_poly=pr.fit_transform(x_train) pr.fit(x_poly,y_train) print("checklist2") from sklearn.linear_model import LinearRegression lr=LinearRegression() lr.fit(x_poly,y_train) plt.scatter(x,y,color="Red") plt.plot(x_test,lr.predict(pr.fit_transform(x_test)),color="black") plt.show()

ishikaahuja commented 4 years ago

please help me I have the same error with no huge data values in my data set