ryanbressler / CloudForest

Ensembles of decision trees in go/golang.
Other
736 stars 92 forks source link

Need balanced bagging, other strategies, for unbalanced data. #6

Closed ryanbressler closed 10 years ago

ryanbressler commented 10 years ago

The plan is to implement balanced sampling of cases with replacement at the bagging level as follows:

Sample which class to draw from (uniform distribution to ensure balance on average). Draw a case from that class with replacement. Repeat.

We already have cost weighted classification. Please comment or open issues with other strategies.

References: http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163175/ http://www.biomedcentral.com/1471-2105/11/523 http://bib.oxfordjournals.org/content/early/2012/03/08/bib.bbs006 http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0067863

ryanbressler commented 10 years ago

I implemented simple balanced bagging with the -ballance option but haven't had time to test it heavily. It is implemented as:

build list of samples per category loop nSamples times draw a category draw a sample from that category

Largely in this file: https://github.com/ryanbressler/CloudForest/blob/master/sampeling.go#L7

ryanbressler commented 10 years ago

We have a few diffrent methods now.