google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
447 stars 49 forks source link

porting an example from tensorflow #86

Closed prashant-saxena closed 1 month ago

prashant-saxena commented 3 months ago

Hello,

I'm trying to convert a simple of project of silent (little noise) detection in audio files to ydf from tensorflow. The input data is single numpy array of shape (1500, 20). There are 1500 samples of Mel Frequency Cepstrum Coefficient (MFCC) with 20 floats in each.

How do I train this data using ydf? Later I would like to generate predictions of a single MFCC array of 20 floats.

Thanks

rstz commented 3 months ago

Hi, you can train directly on multi-dimensional numpy data as explained in the documentation: https://ydf.readthedocs.io/en/latest/tutorial/multidimensional_feature

The super short version of it is (with random data)

import ydf
num_examples = 10000
num_rows = 20
train_data = np.random.uniform(size=(num_examples, num_rows))
train_label = np.random.randint(0, 2, size=(num_examples))

train_ds = {"features": train_data, "label": train_label}

model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

test_data = {"features": np.random.uniform(size=(1, num_rows))}

model.predict(test_data)
prashant-saxena commented 3 months ago

Hi, Thanks for the tip. I have tried as you suggested but prediction values are like random values between 0.0 and 1.0, not at all useful.

prashant-saxena commented 3 months ago

Ok, Here is the test. Extract files(train.npy, test.npy) from the attached zip file

import numpy as np
import ydf

train_data = np.load('train.npy')
train_label = np.random.randint(0, 2, size=(train_data.shape[0]))

print(train_data.shape)

train_ds = {"features": train_data, "label": train_label}
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
test_data = {"features": np.load('test.npy')}

predictions = model.predict(test_data)
print(predictions)

For the same data, TensorFlow's predictions are 99% correct but ydf's predictions look random to me. Am I missing something here? ydf.zip

achoum commented 2 months ago

This notebook shows how to train a model on this dataset and make predictions with a Random Forest and a Gradient Boosted Trees model. The notebook also runs a cross-validation to evaluate the quality of predictions on this small dataset.

The model self evaluation (model.describe() ; out-of-bag accuracy of 53%) and cross-validation (learner.cross_validation(train_ds) ; accuracy=50%, AUC=51%) shows that the input features are virtually not correlated with the labels.

You mention that with "TensorFlow's predictions are 99% correct". Are you sure you are using the same dataset? If so, are you sure you are not evaluating on the training dataset?