applyforest not happy with gradient boosting via -gbt

glycerine commented 5 years ago

Seems like the -gbt gradient boosting might need a little love. I could get it to grow, but applyforest was unhappy (crashed; see below).

## first without -gbt, works:
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ growforest -train data/iris.data.fm -rfpred forest.sf -target C:Class -mTry 2 -nTrees 20 -leafSize 1 
Threads : 1
nTrees : 20
Loading data from: data/iris.data.fm
Target : C:Class
Non Target Features : 4
mTry : 2
non-missing cases: 150
leafSize : 1
nSamples : 150
Performing classification with 3 categories.
Total training time (seconds): 0.008251627
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ applyforest -fm ./data/iris.data.fm -rfpred forest.sf -preds pred.tsv
Target is C:Class in feature 4
Error: 0.00666666666666671
Outputting label predicted actual tsv to pred.tsv

## now with -gbt, growforest works, but applyforest crashes:
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ growforest -train data/iris.data.fm -rfpred forest.sf -target C:Class -mTry 2 -nTrees 20 -leafSize 1 -gbt 0.0001
Threads : 1
nTrees : 20
Loading data from: data/iris.data.fm
Target : C:Class
Non Target Features : 4
mTry : 2
non-missing cases: 150
leafSize : 1
nSamples : 150
Performing classification with 3 categories.
Using Gradient Boosting Classification with postive class: True
Total training time (seconds): 0.002043676
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ applyforest -fm ./data/iris.data.fm -rfpred forest.sf -preds pred.tsv
Target is C:Class in feature 4
Error: 1
Outputting label predicted actual tsv to pred.tsv
panic: interface conversion: CloudForest.VoteTallyer is *CloudForest.CatBallotBox, not *CloudForest.NumBallotBox

goroutine 1 [running]:
main.main()
    /Users/jaten/go/src/github.com/lytics/CloudForest/applyforest/applyforest.go:96 +0x14af
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ git log|head
commit 1af2b08f235b69b4393276f293055e1d89d1a555
Merge: 0e79e70 7075fe5
Author: Tim Kaye <tim.kaye@lytics.io>
Date:   Wed Nov 21 10:24:03 2018 -0800

    Merge pull request #18 from lytics/determinstic_ballotbox

    Ensure that predictions/tallies are always deterministic

commit 7075fe572867b7e6ac7ea9fc4fea4f9f603c6025
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ go version
go version go1.10.2 darwin/amd64

glycerine commented 5 years ago

I get all NaN predictions if I try using the -sum flag.

jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ applyforest -fm ./data/iris.data.fm -rfpred forest.sf -preds pred.tsv -sum
Target is C:Class in feature 4
Error: 1
Outputting label predicted actual tsv to pred.tsv
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ head pred.tsv
1       NaN     Iris-setosa
2       NaN     Iris-setosa
3       NaN     Iris-setosa
4       NaN     Iris-setosa
5       NaN     Iris-setosa
6       NaN     Iris-setosa
7       NaN     Iris-setosa
8       NaN     Iris-setosa
9       NaN     Iris-setosa
10      NaN     Iris-setosa
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $

timkaye11 commented 5 years ago

To avoid the panic, I think we need to do something like this on linke 96 (where the panic happens):

https://github.com/lytics/CloudForest/blob/1af2b08f235b69b4393276f293055e1d89d1a555/applyforest/applyforest.go#L59-L69

glycerine commented 5 years ago

I toyed with that a little.

Could we pop out to the bigger picture, for a moment?

Does the requirement to have the user need to specify the type of ballot box depending on the type of target used seem a little odd to you?

What I mean is, I would usually expect to call a Predict() method, and the model -- which already knows what target was used during building -- would "do the right thing" -- would behave properly (polymorphically) for the type of Forest or ForestModel at hand.

Then the -mean and -sum flags would be rendered unnecessary.

timkaye11 commented 5 years ago

Does the requirement to have the user need to specify the type of ballot box depending on the type of target used seem a little odd to you?

Yeah I think that is unusual. Having the ballot box type be inferred sounds like a better approach. However, there isn't really a good way of determining the "Model type" from a Forest (ie there are no flags/fields on the forest struct that describe it) https://github.com/lytics/CloudForest/blob/1af2b08f235b69b4393276f293055e1d89d1a555/forest.go#L17-L22

So one possible solution would be to add a ModelType field (or something analogous) to the forest. This would eliminate the need for the -mean and -sum flags.

maybe something like this?

func predict(fm *FeatureMatrix, f *Forest) VoteTallyer {
    data := fm.Data[fm.Map[f.Target]]
    n := data.Length()
    var tallyBox VoteTallyer
    switch data.(type) {
    case *DenseNumFeature:
        tallyBox = NewNumBallotBox(n)
    case *DenseCatFeature:
        tallyBox = NewCatBallotBox(n)
    default:
                if f.Type == "gbm" { 
                    tallyBox = NewSumBallotBox(n)
                 }
        return nil
    }
}

glycerine commented 5 years ago

That all makes sense. Would not ideally a Predict() method would return not a VoteTallyer but the the set of predictions (Yhat(s)) corresponding to the fm data input?

func Predict(fm *FeatureMatrix, f *Forest) (yhat []float64) { ... }

The only tricky part is for classifiers rather than regressions. Then yhat is used to being []string, a slice of category strings predicted by the classifier, rather than []float64. But perhaps a map from category to encoding in whole numbers could be included as a part of the feature matrix... oh, wait, it is already there, as the Map map[string]int, I believe.

type FeatureMatrix struct {
    Data       []Feature
    Map        map[string]int
    CaseLabels []string
}

So would returning []float64 work for all types of trees, do you think?

timkaye11 commented 5 years ago

predict was a poor choice for the function name above. Instead something like

func getTallyerForForest(fm *FeatureMatrix, f *Forest) VoteTallyer { 
...
}

could be better.

If you wanted to have a Predict function that supports all model-types, it might make sense to return []string instead of a []float64, since all the ballot-boxes implement Tally(i int) (predicted string)

so something like:

func PredictStr(fm *FeatureMatrix, f *Forest) []string {
   tallyer := getTallyerForForest(fm, f)
   for _, tree := range f.Trees {
      tree.Vote(fm, tallyer)
   }
   var preds []string 
   for i :=0; i < fm.Data[0].Length(); i++ {
      preds = append(preds, tallyer.Tally(i))
   }
   return preds
}

glycerine commented 5 years ago

well, regressors will need the float64, and parsing strings into float64 is slow/expensive.

glycerine commented 5 years ago

I like your function. How about a simple "union" approach to the return value.

type Predictions struct {
  // only one of Real or Cat will be populated
  Real []float64
  Cat  []string
  IsReal bool // true for regression Mean and gradient boosted Sum talliers; else Cat is used for classifier output
}

then Predict returns the right field populated, the other with length 0.

func Predict(fm *FeatureMatrix, f *Forest) (*Predictions, error) { 
   tallyer := getTallyerForForest(fm, f)
   for _, tree := range f.Trees {
      tree.Vote(fm, tallyer)
   }

   if isCat {
      var preds []string 
      for i :=0; i < fm.Data[0].Length(); i++ {
         preds = append(preds, tallyer.Tally(i))
      }
      return &Predictions{Cat:preds}, nil
   }
   // the logic for the regression and gradient boosting goes here
    ...
    return &Predictions{Real: preds, IsCat:false}, nil
}

glycerine commented 5 years ago

incomplete, but this pull requests gives an idea of what I was thinking:

https://github.com/lytics/CloudForest/pull/25

lytics / CloudForest

applyforest not happy with gradient boosting via -gbt #24