harrysouthworth / gbm

Gradient boosted models
Other
106 stars 27 forks source link

[Question] How are candidate splits for continuous variables determined? #24

Closed patr1ckm closed 9 years ago

patr1ckm commented 10 years ago

Can someone shed some light on how gbm chooses candidate splits for continuous variables? I've understood previously that continuous variables are discretized to make candidate splits.

I've been reading through the C++, and narrowed my way to CNodeSearch::IncorporateObs. I've included what looks like the relevant bit of the code below (commenting out a few things):

// Evaluate the current split
// the newest observation is still in the right child
  dCurrentSplitValue = 0.5*(dLastXValue + dX);

    if((dLastXValue != dX) && 
        (cCurrentLeftN >= cMinObsInNode) && 
        (cCurrentRightN >= cMinObsInNode) &&
        ((lMonotone==0) ||
        (lMonotone*(dCurrentRightSumZ*dCurrentLeftTotalW - 
                    dCurrentLeftSumZ*dCurrentRightTotalW) > 0)))
    {
        dCurrentImprovement = // compute improvement

        if(dCurrentImprovement > dBestImprovement)
        {
            // update the splitting variable, split value, improvement;
            // and N, sumZ, and TotalW for left and right nodes.

        }
    }

I understand that each predictor is ordered, and I think dX is a value of a predictor for an individual, and dLastXValue is the lowest value of the predictor.

From the code, it looks like the candidate split is just the average of the current observation (dX) and the last. This seems to indicate that a candidate split is generated for every observation, and that there will be n-1 candidate splits for each predictor. Is this correct?

Neil-Schneider commented 10 years ago

Correct, it will consider splits between every unique value of a continuous variable. The actual split value is the average between two consecutive observations. If the feature is truly continuous and each observation is unique, then there will be n-1 candidate splits. Although, if there are observations with the same values, the number of splits is reduces because it will only attempt to split around that unique value once.

patr1ckm commented 10 years ago

Thanks!