google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
447 stars 49 forks source link

Train on Pandas dataFrame with heade = None does not work #97

Closed TonyCongqianWang closed 1 week ago

TonyCongqianWang commented 3 weeks ago

I used pandas to import a csv with no header. All headers names are autogenerated and numerical. Using label="0" will result in ValueError: Column '0' is required but was not found in the data. Available columns: [0, 1, 2 ... While using label=0 will result in ValueError: Constructing the learner requires a non-empty label.. The problem also occurs when column names are numerical instead of strings

achoum commented 3 weeks ago

Thanks for the alert. We will work on improving header-less support for pandas dataframe. In the meantime, if your dataset has no column names, you can feed it as a single multi-dimensional feature using numpy. Here is an example:

import numpy as np
import ydf

X = np.random.uniform(size=(100,5))
y = np.random.uniform(size=(100)) >= 0.5
model = ydf.RandomForestLearner(label="label").train({"features":X, "label":y})
model.input_features()

Using to_numpy, you can train YDF models on header-less pandas dataframes by turning them into numpy arrays.

import pandas as pd

X = pd.DataFrame([[1,2,3],[4,5,6]]).to_numpy()
y = pd.DataFrame([1,2]).to_numpy()[:, 0]

model = ydf.RandomForestLearner(label="label").train({"features":X, "label":y})
model.input_features()
TonyCongqianWang commented 3 weeks ago

Thanks for the quick reply! The solution I used was to rename the features witth dict = {0 : "y", 1 : "feature_0", 2: "feature_1" .... } df = df.rename(dict)

which also worked fine

achoum commented 1 week ago

Solved in 0.5.0 release.