google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
498 stars 53 forks source link

may you help to understand how unbalanced data treaded in your code? #48

Closed Sandy4321 closed 1 year ago

Sandy4321 commented 1 year ago

may you help to understand how unbalanced data treaded in your code? https://arxiv.org/pdf/2212.02934.pdf

rstz commented 1 year ago

Hi, can you please clarify your question?

Sandy4321 commented 1 year ago

when data is unbalanced - meaning some labels count is much bigger than another labels count for example YES labels count is 123 but NO label count is 9876543 so overall we do hav 123 + 9876543 = 987666 samples (rows) then prediction algorithm should be designed with special treatment to get high value for F1 score for details pls refer to https://machinelearningmastery.com/xgboost-for-imbalanced-classification/

to sum up I do not see in your paper how this unbalanced data issue is addressed but hopefully in any case you do have proper unbalanced data treatment

rstz commented 1 year ago

YDF supports example weights, which allows the user to perform re-weighting of the training examples through all the methods explained in the article. The weights can be set manually or through a mapping. See the WeightDefinition proto for details

Sandy4321 commented 1 year ago

I see // "LinkedWeightDefinition" is a pre-processed version of "WeightDefinition"

does it means weights calculated from labels ratio automatically (not only manually as follows from your answer (The weights can be set manually or through a mapping) ?