google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
447 stars 49 forks source link

may you help to understand how unbalanced data treaded in your code? #48

Closed Sandy4321 closed 12 months ago

Sandy4321 commented 1 year ago

may you help to understand how unbalanced data treaded in your code? https://arxiv.org/pdf/2212.02934.pdf

rstz commented 1 year ago

Hi, can you please clarify your question?

Sandy4321 commented 1 year ago

when data is unbalanced - meaning some labels count is much bigger than another labels count for example YES labels count is 123 but NO label count is 9876543 so overall we do hav 123 + 9876543 = 987666 samples (rows) then prediction algorithm should be designed with special treatment to get high value for F1 score for details pls refer to https://machinelearningmastery.com/xgboost-for-imbalanced-classification/

to sum up I do not see in your paper how this unbalanced data issue is addressed but hopefully in any case you do have proper unbalanced data treatment

rstz commented 12 months ago

YDF supports example weights, which allows the user to perform re-weighting of the training examples through all the methods explained in the article. The weights can be set manually or through a mapping. See the WeightDefinition proto for details

Sandy4321 commented 12 months ago

I see // "LinkedWeightDefinition" is a pre-processed version of "WeightDefinition"

does it means weights calculated from labels ratio automatically (not only manually as follows from your answer (The weights can be set manually or through a mapping) ?