serengil / chefboost

A Lightweight Decision Tree Framework supporting regular algorithms: ID3, C4.5, CART, CHAID and Regression Trees; some advanced techniques: Gradient Boosting, Random Forest and Adaboost w/categorical features support for Python
https://www.youtube.com/watch?v=Z93qE5eb6eg&list=PLsS_1RYmYQQHp_xZObt76dpacY543GrJD&index=3
MIT License
453 stars 101 forks source link

findDecision incorrect? #25

Closed rjgarciar closed 9 months ago

rjgarciar commented 2 years ago

I have a CSV with pre-calculated cosine distance between face embeddings of people images in my dataset like this:

       Person1     Person2  Idx1  Idx2  Distance Decision
0   Aaron Paul  Aaron Paul     0     1    0.3245      Yes
1   Aaron Paul  Aaron Paul     0     2    0.2281      Yes
2   Aaron Paul  Aaron Paul     0     3    0.4737      Yes
3   Aaron Paul  Aaron Paul     0     4    0.4103      Yes
4   Aaron Paul  Aaron Paul     0     5    0.3236      Yes
5   Aaron Paul  Aaron Paul     0     6    0.3270      Yes
6   Aaron Paul  Aaron Paul     0     7    0.4873      Yes
7   Aaron Paul  Aaron Paul     0     8    0.3988      Yes
8   Aaron Paul  Aaron Paul     1     2    0.2357      Yes
9   Aaron Paul  Aaron Paul     1     3    0.2613      Yes
10  Aaron Paul  Aaron Paul     1     4    0.3827      Yes
11  Aaron Paul  Aaron Paul     1     5    0.2221      Yes
12  Aaron Paul  Aaron Paul     1     6    0.2183      Yes
13  Aaron Paul  Aaron Paul     1     7    0.4568      Yes
14  Aaron Paul  Aaron Paul     1     8    0.2391      Yes
15  Aaron Paul  Aaron Paul     2     3    0.4439      Yes
16  Aaron Paul  Aaron Paul     2     4    0.4086      Yes
17  Aaron Paul  Aaron Paul     2     5    0.2592      Yes
18  Aaron Paul  Aaron Paul     2     6    0.2863      Yes
19  Aaron Paul  Aaron Paul     2     7    0.4588      Yes

And I use this script to calculate findDecision tree:

import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
tqdm.pandas()

if __name__ == '__main__':
    ##############################################################################
    # Leer CSV para determinar el mejor threshold...
    df = pd.read_csv(R"\\10.15.20.109\e$\MODELS\ProtecFR\Model\faces2.csv", encoding='UTF8')
    print(df.head(20))

    df1 = df[df['Decision'] == "Yes"]['Distance'].copy()
    df2 = df[df['Decision'] == "No"]['Distance'].copy()
    print(f"Count Yes: {df1.count()}")
    print(f"Average Yes: {round(df1.mean(), 4)}")
    print(f"Std. deviation Yes: {round(df1.std(), 4)}")
    print(f"Min Yes: {round(df1.min(), 4)}")
    print(f"Max Yes: {round(df1.max(), 4)}")
    print(f"Mode Yes: {round(df1.mode()[0], 4)}")

    print(f"Count No: {df2.count()}")
    print(f"Average No: {round(df2.mean(), 4)}")
    print(f"Std. deviation No: {round(df2.std(), 4)}")
    print(f"Min No: {round(df2.min(), 4)}")
    print(f"Max No: {round(df2.max(), 4)}")
    print(f"Mode No: {round(df2.mode()[0], 4)}")

    df1.plot.kde()
    df2.plot.kde()
    plt.legend(["Yes", "No"])
    plt.grid()
    plt.axhline(0,color='red')
    plt.axvline(0,color='red')
    plt.show()

    from chefboost import Chefboost as chef
    config = {'algorithm': 'C4.5'}

    tmp_df = df[['Distance', 'Decision']].copy()
    model = chef.fit(tmp_df, config)
    print (model)

The results I get are:

Count Yes: 108285
Average Yes: 0.4496
Std. deviation Yes: 0.1557
Min Yes: 0.0
Max Yes: 1.0644
Mode Yes: 0.3465

Count No: 59793700
Average No: 0.7976
Std. deviation No: 0.1112
Min No: 0.0
Max No: 1.2973
Mode No: 0.8114

[INFO]:  8 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...
-------------------------
finished in  135.35767483711243  seconds
-------------------------
Evaluate  train set
-------------------------
Accuracy:  99.81929981118321 % on  59901985  instances
Labels:  ['Yes' 'No']
Confusion matrix:  [[43, 1], [108242, 59793699]]
Precision:  97.7273 %, Recall:  0.0397 %, F1:  0.0794 %
{'trees': [<module 'outputs/rules/rules' from 'c:\\DESARROLLOS\\Python\\VID\\outputs/rules/rules.py'>], 'alphas': [], 'config': {'algorithm': 'C4.5', 'enableRandomForest': False, 'num_of_trees': 5, 'enableMultitasking': False, 'enableGBM': False, 'epochs': 10, 'learning_rate': 1, 'max_depth': 3, 'enableAdaboost': False, 'num_of_weak_classifier': 4, 'enableParallelism': True, 'num_cores': 8}, 'nan_values': [['Distance', None]]}

The plot is:

ArcFace-cosine

and outputs/rules/rules.py:

def findDecision(obj): #obj[0]: Distance
    # {"feature": "Distance", "instances": 59901985, "metric_value": 0.0191, "depth": 1}
    if obj[0]>0.0:
        return 'No'
    elif obj[0]<=0.0:
        return 'Yes'
    else: return 'Yes'

As you can see, it gives me a 0.0 threshold when it should be around 0.68.

Am I doing something wrong?

Regards

serengil commented 2 years ago

can you share the data set?

serengil commented 2 years ago

it might be a rounding problem. in the comment line it says "metric_value": 0.0191

rjgarciar commented 2 years ago

Shouldn't it be around 0.5 looking at the plot?

The data set has these columns:

I know this is out of topic, it should be in your deepface package, but it's the reason I was trying to stablish a threshold: it is relatively common that:

Perhaps this is the reason of "metric_value": 0.0191...

Data set is available here

serengil commented 2 years ago

Data set size is really large and i cannot download it. Could subsample it and share here again?

rjgarciar commented 2 years ago

I have uploaded here faces_3.csv a 50% subsampling of original data.

alwaysmpe commented 10 months ago

For large datasets the code isn't evaluating every possible partition (presumably due to performance). Instead it's using mean and +/- 1-3 std deviations. This subsampling is implemented in ‎processContinuousFeatures.