namoopsoo commented 7 years ago

`lambda-ml.random-forest`

https://cloudkj.github.io/lambda-ml/lambda-ml.random-forest.html

example usage

(def data [[0 0 0] [0 1 1] [1 0 1] [1 1 0]])
(def fit
(let [n 1001
    min-split 2
    min-leaf 1
    max-features 2]
(-> (make-random-forest-classifier n min-split min-leaf max-features)
    (random-forest-fit data))))
(random-forest-predict fit (map butlast data))
;;=> (0 1 1 0)

namoopsoo commented 7 years ago

load , split and classify...

working with a simple dataset from https://github.com/namoopsoo/learn-citibike/issues/13

In [39]: simpledf.head()
Out[39]: 
start_postal_code  start_sublocality  start_neighborhood  start_day  \
0                  1                  2                  17          3   
1                  1                  2                  17          3   
2                  1                  2                  17          3   
3                  1                  2                  17          3   
4                  1                  2                  17          3   

start_hour   age  gender  end_neighborhood  
0           0  42.0       1                17  
1           7  17.0       1                17  
2           9  42.0       1                17  
3          14  23.0       1                17  
4          15  45.0       1                17

grab it... following some tips from https://www.packtpub.com/books/content/working-incanter-datasets on Incanter dataset practices
```
(def fname-medium
"datas/201510-citibike-tripdata.medium-simple.csv")
```

; get data as vector of records first.. (def simple-data (load-csv-data-to-maps fname-medium))

; then make an Incanter dataset... (def simple-dataset (incore/to-dataset simple-data))

; matrix-project.core=> (incore/col-names simple-dataset) ; [: :start_postal_code :start_sublocality :start_neighborhood :start_day :start_hour :age :gender :end_neighborhood]

; matrix-project.core=> (incore/nrow simple-dataset) ; 1035378

; fancy column selection, reminds me of pandas ; matrix-project.core=> (incore/ncol (incore/$ [:start_hour :age] simple-dataset)) ; 2

* also this is kind of impressive...
```clojure
matrix-project.core=> (incore/view simple-dataset)

=>

matrix operations...

Lets see if i can do some sampling...


(def simple-matrix (incore/to-matrix simple-dataset))

; matrix-project.core=> (def simple-matrix (incore/to-matrix simple-dataset)) ; #'matrix-project.core/simple-matrix

; selecting columns.. ; matrix-project.core=> (incore/$ [:start_postal_code :end_neighborhood] simple-dataset)

; selecting rows and columns.. matrix-project.core=> (incore/$ (range 5) [:start_postal_code :end_neighborhood] simple-dataset)

| :start_postal_code | :end_neighborhood | |--------------------+-------------------| | 1 | 17 | | 1 | 17 | | 1 | 17 | | 1 | 17 | | 1 | 17 |

* heh this is funny.. perhaps `pandas` as to `Incanter` as `numpy` is to `clatrix` ? Similar parallel evolution perhaps?
```clojure
; matrix-project.core=> (type simple-matrix)
; clatrix.core.Matrix

random sampling per https://www.packtpub.com/books/content/working-incanter-datasets by using a special Incanter operator :$fn,
```
(def random-half-dataset (incore/$where {:start_postal_code {incore/:$fn (fn [_] (< (rand) 0.5))}} simple-dataset))
```

(def random-half-dataset (incore/$where
{:start_postalcode {:incore/$fn (fn [] (< (rand) 0.5))}} simple-dataset))

; hmm ... getting weird errors..


#### above random sampling not quite working. for now let me just split the data in half lazily since i know how to select rows..
```clojure
; matrix-project.core=> (incore/nrow simple-dataset)
; 1035378

;  => 1035378 many rows.. 
; so first 80% will be training 

; matrix-project.core=> (* 1035378 0.8)
; 828302.4
; => (range 828302)  and (range 828302 1035378)
; Train
(time (def dumb-training-dataset 
    (incore/$  (range 828302) :all simple-dataset)))

; Holdout
; and last 20% , use as a holdout
(time (def dumb-holdout-dataset 
    (incore/$ (range 828302 1035378) :all simple-dataset)))

; Wow thats slow ^^^ 
; had to stop this because it was taking forever...

And most likely i was doing the random sampling using $fn as part of incanter.core but i think now it is more like a special keyword perhaps like the special :all keyword...
```
; perhaps like this ? ...
(def training-dataset (incore/$where  
 {:start_postal_code {:$fn (fn [_] (< (rand) 0.8))}} 
 simple-dataset))
```

; hmm looks like that worked... and it was nearly instantaneous... matrix-project.core=> (def training-dataset (incore/$where

_=> {:start_postalcode {:$fn (fn [] (< (rand) 0.8))}}

             #_=>      simple-dataset))

'matrix-project.core/training-dataset

matrix-project.core=> (incore/nrow training-dataset) 827963

; Although now i would need to take the remaining rows for the second dataset. (def training-indices (incore/$ :index training-dataset)) (def training-indices-set (set training-indices))

(def test-dataset (incore/$where
{:index {:$fn (fn [x] (= (conj training-indices-set x) training-indices-set))}} simple-dataset))

; hmm... this still isnt working... going to use python for the train test split for now...

; set membership ; (= (conj s :e) #{:a :b :c :d :e}) ; -> true


#### all together now..
* ..
```clojure
(def fname-medium-train
  "datas/201510-citibike-tripdata.medium-simple-train.csv")
(def fname-medium-holdout
  "datas/201510-citibike-tripdata.medium-simple-holdout.csv")

; get data as vector of records first..
(def simple-data-train (load-csv-data-to-maps fname-medium-train))
(def simple-data-holdout (load-csv-data-to-maps fname-medium-holdout))

; then make an Incanter dataset...
(def simple-dataset-train (incore/to-dataset simple-data-train))
(def simple-dataset-holdout (incore/to-dataset simple-data-holdout))

(def training-matrix (incore/to-matrix simple-dataset-train))
(def holdout-matrix (incore/to-matrix simple-dataset-holdout))

; classifier..
(time (def myclassifier (do-forest-train-classifier training-matrix)))
; matrix-project.core=> (time (def myclassifier (do-forest-train-classifier training-matrix)))
; "Elapsed time: 0.620564 msecs"
#'matrix-project.core/myclassifier

; ==> seems to be lazy heh..

(def outs (random-forest-predict myclassifier (map butlast holdout-matrix)))

okay so small scale worked...


matrix-project.core=> (def myclassifier (do-forest-train-classifier (take 10 training-matrix)))
#'matrix-project.core/myclassifier
matrix-project.core=>

matrix-project.core=> (def outs (random-forest-predict myclassifier (map butlast (take 10 holdout-matrix))))

'matrix-project.core/outs

matrix-project.core=> outs (8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0) matrix-project.core=>

* large scale crapped out...
```clojure

matrix-project.core=> (time (def myclassifier (do-forest-train-classifier training-matrix)))
"Elapsed time: 0.620564 msecs"
#'matrix-project.core/myclassifier
matrix-project.core=> 

matrix-project.core=> 

matrix-project.core=> (def outs (random-forest-predict myclassifier (map butlast holdout-matrix)))

CompilerException java.lang.StackOverflowError, compiling:(form-init210517701811841828.clj:1:11) 
matrix-project.core=>

Uh oh, https://examples.javacodegeeks.com/java-basics/exceptions/java-lang-stackoverflowerror-how-to-solve-stackoverflowerror/ , indicates ...

"the StackOverflowError in Java. This error is thrown to indicate that the application’s stack was exhausted, due to deep recursion."
This makes me think perhaps the random forest code was not written to handle large data..hmm..

namoopsoo commented 7 years ago

other notes

namoopsoo commented 7 years ago

Quick try of the decision tree classifier


 (time (def myclassifier (do-tree-train-classifier training-matrix)))   ; taking long time ..
; (time (def myclassifier (do-tree-train-classifier (take 10000 training-matrix))))

(def outs (decision-tree-predict myclassifier (map butlast holdout-matrix)))

; quick compare results...
(defn make-pairs [list1 list2] (partition 2 (interleave list1 list2)))
(def pairs (make-pairs outs (map last holdout-matrix)))
(count  (filter true? (map #(apply = %1) pairs)))

==>


matrix-project.core=> (time (def myclassifier (do-tree-train-classifier (take 10000 training-matrix))))
"Elapsed time: 129409.438842 msecs"
#'matrix-project.core/myclassifier
matrix-project.core=>

matrix-project.core=>

matrix-project.core=> (def outs (decision-tree-predict myclassifier (map butlast holdout-matrix)))

'matrix-project.core/outs

matrix-project.core=> (def pairs (make-pairs outs (map last holdout-matrix)))

'matrix-project.core/pairs

matrix-project.core=> (first pairs) (8.0 7.0) matrix-project.core=> (def comparisons (map #(apply = %1) pairs))

'matrix-project.core/comparisons

matrix-project.core=> (first comparisons) false matrix-project.core=> (def true-comparisons (filter true? comparisons))

'matrix-project.core/true-comparisons

matrix-project.core=> (count true-comparisons) 89244 matrix-project.core=> (count holdout-matrix) 207076 matrix-project.core=> (float (/ 89244 207076)) 0.4309722


#### Comments
* Okay, so on commit `bcfc435` , it looks like although a super rudimentary forest classifier was messing up, but a single tree classifier trained on `10000` , using the `max-features 2` option, has chosen the `end_neighborhood` `43%` of the time here.
* reminder on the holdout data...
```python
In [26]: holdout_df.shape
Out[26]: (207076, 8)

In [27]: holdout_df.head()
Out[27]: 
        start_postal_code  start_sublocality  start_neighborhood  start_day  \
782530                  2                  2                  16          5   
939491                  5                  2                  16          0   
211090                  9                  2                  16          1   
290972                 29                  2                  17          4   
29890                   9                  2                  17          0   

        start_hour   age  gender  end_neighborhood  
782530          10  39.0       1                16  
939491          21  59.0       1                16  
211090          17  48.0       1                17  
290972          16  50.0       1                17  
29890            7  44.0       2                16  

In [28]: holdout_df.end_neighborhood.value_counts()
Out[28]: 
17    92189
16    70703
20     7468
21     6504
22     5062
18     2748
15     2581
7      2303
11     1968
19     1960
2      1662
12     1652
10     1576
9      1388
6      1267
4      1235
14     1185
13     1121
5       978
3       543
8       404
1       294
23      285
Name: end_neighborhood, dtype: int64

not quite sure how to figure out what were the feature importances however... this didnt yield much

matrix-project.core=> (keys myclassifier)
(:cost :prediction :weighted :min-split :min-leaf :max-features :parameters)

namoopsoo / play-clj-ml

Milestone 3: make simple lambda-ml.random-forest classifier model #3

`lambda-ml.random-forest`

load , split and classify...

matrix operations...

_=> {:start_postalcode {:$fn (fn [] (< (rand) 0.8))}}

'matrix-project.core/training-dataset

'matrix-project.core/outs

any other libraries to try ?

spark

bigml

other notes

Quick try of the decision tree classifier

'matrix-project.core/outs

'matrix-project.core/pairs

'matrix-project.core/comparisons

'matrix-project.core/true-comparisons