namoopsoo / play-clj-ml

Messing around with machine learning in clojure
0 stars 0 forks source link

Milestone 3: make simple lambda-ml.random-forest classifier model #3

Open namoopsoo opened 7 years ago

namoopsoo commented 7 years ago

lambda-ml.random-forest

namoopsoo commented 7 years ago

load , split and classify...

; get data as vector of records first.. (def simple-data (load-csv-data-to-maps fname-medium))

; then make an Incanter dataset... (def simple-dataset (incore/to-dataset simple-data))

; matrix-project.core=> (incore/col-names simple-dataset) ; [: :start_postal_code :start_sublocality :start_neighborhood :start_day :start_hour :age :gender :end_neighborhood]

; matrix-project.core=> (incore/nrow simple-dataset) ; 1035378

; fancy column selection, reminds me of pandas ; matrix-project.core=> (incore/ncol (incore/$ [:start_hour :age] simple-dataset)) ; 2

* also this is kind of impressive...
```clojure
matrix-project.core=> (incore/view simple-dataset)

matrix operations...

; matrix-project.core=> (def simple-matrix (incore/to-matrix simple-dataset)) ; #'matrix-project.core/simple-matrix

; selecting columns.. ; matrix-project.core=> (incore/$ [:start_postal_code :end_neighborhood] simple-dataset)

; selecting rows and columns.. matrix-project.core=> (incore/$ (range 5) [:start_postal_code :end_neighborhood] simple-dataset)

| :start_postal_code | :end_neighborhood | |--------------------+-------------------| | 1 | 17 | | 1 | 17 | | 1 | 17 | | 1 | 17 | | 1 | 17 |

* heh this is funny.. perhaps `pandas` as to `Incanter` as `numpy` is to `clatrix` ? Similar parallel evolution perhaps?
```clojure
; matrix-project.core=> (type simple-matrix)
; clatrix.core.Matrix

(def random-half-dataset (incore/$where
{:start_postalcode {:incore/$fn (fn [] (< (rand) 0.5))}} simple-dataset))

; hmm ... getting weird errors..


#### above random sampling not quite working. for now let me just split the data in half lazily since i know how to select rows..
```clojure
; matrix-project.core=> (incore/nrow simple-dataset)
; 1035378

;  => 1035378 many rows.. 
; so first 80% will be training 

; matrix-project.core=> (* 1035378 0.8)
; 828302.4
; => (range 828302)  and (range 828302 1035378)
; Train
(time (def dumb-training-dataset 
    (incore/$  (range 828302) :all simple-dataset)))

; Holdout
; and last 20% , use as a holdout
(time (def dumb-holdout-dataset 
    (incore/$ (range 828302 1035378) :all simple-dataset)))

; Wow thats slow ^^^ 
; had to stop this because it was taking forever...

; hmm looks like that worked... and it was nearly instantaneous... matrix-project.core=> (def training-dataset (incore/$where

_=> {:start_postalcode {:$fn (fn [] (< (rand) 0.8))}}

             #_=>      simple-dataset))

'matrix-project.core/training-dataset

matrix-project.core=> (incore/nrow training-dataset) 827963

; Although now i would need to take the remaining rows for the second dataset. (def training-indices (incore/$ :index training-dataset)) (def training-indices-set (set training-indices))

(def test-dataset (incore/$where
{:index {:$fn (fn [x] (= (conj training-indices-set x) training-indices-set))}} simple-dataset))

; hmm... this still isnt working... going to use python for the train test split for now...

; set membership ; (= (conj s :e) #{:a :b :c :d :e}) ; -> true


#### all together now..
* ..
```clojure
(def fname-medium-train
  "datas/201510-citibike-tripdata.medium-simple-train.csv")
(def fname-medium-holdout
  "datas/201510-citibike-tripdata.medium-simple-holdout.csv")

; get data as vector of records first..
(def simple-data-train (load-csv-data-to-maps fname-medium-train))
(def simple-data-holdout (load-csv-data-to-maps fname-medium-holdout))

; then make an Incanter dataset...
(def simple-dataset-train (incore/to-dataset simple-data-train))
(def simple-dataset-holdout (incore/to-dataset simple-data-holdout))

(def training-matrix (incore/to-matrix simple-dataset-train))
(def holdout-matrix (incore/to-matrix simple-dataset-holdout))

; classifier..
(time (def myclassifier (do-forest-train-classifier training-matrix)))
; matrix-project.core=> (time (def myclassifier (do-forest-train-classifier training-matrix)))
; "Elapsed time: 0.620564 msecs"
#'matrix-project.core/myclassifier

; ==> seems to be lazy heh..

(def outs (random-forest-predict myclassifier (map butlast holdout-matrix)))

matrix-project.core=> (def outs (random-forest-predict myclassifier (map butlast (take 10 holdout-matrix))))

'matrix-project.core/outs

matrix-project.core=> outs (8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0) matrix-project.core=>

* large scale crapped out...
```clojure

matrix-project.core=> (time (def myclassifier (do-forest-train-classifier training-matrix)))
"Elapsed time: 0.620564 msecs"
#'matrix-project.core/myclassifier
matrix-project.core=> 

matrix-project.core=> 

matrix-project.core=> (def outs (random-forest-predict myclassifier (map butlast holdout-matrix)))

CompilerException java.lang.StackOverflowError, compiling:(form-init210517701811841828.clj:1:11) 
matrix-project.core=> 
namoopsoo commented 7 years ago

any other libraries to try ?

spark

bigml

other notes

namoopsoo commented 7 years ago

Quick try of the decision tree classifier


 (time (def myclassifier (do-tree-train-classifier training-matrix)))   ; taking long time ..
; (time (def myclassifier (do-tree-train-classifier (take 10000 training-matrix))))

(def outs (decision-tree-predict myclassifier (map butlast holdout-matrix)))

; quick compare results...
(defn make-pairs [list1 list2] (partition 2 (interleave list1 list2)))
(def pairs (make-pairs outs (map last holdout-matrix)))
(count  (filter true? (map #(apply = %1) pairs)))

matrix-project.core=>

matrix-project.core=> (def outs (decision-tree-predict myclassifier (map butlast holdout-matrix)))

'matrix-project.core/outs

matrix-project.core=> (def pairs (make-pairs outs (map last holdout-matrix)))

'matrix-project.core/pairs

matrix-project.core=> (first pairs) (8.0 7.0) matrix-project.core=> (def comparisons (map #(apply = %1) pairs))

'matrix-project.core/comparisons

matrix-project.core=> (first comparisons) false matrix-project.core=> (def true-comparisons (filter true? comparisons))

'matrix-project.core/true-comparisons

matrix-project.core=> (count true-comparisons) 89244 matrix-project.core=> (count holdout-matrix) 207076 matrix-project.core=> (float (/ 89244 207076)) 0.4309722


#### Comments
* Okay, so on commit `bcfc435` , it looks like although a super rudimentary forest classifier was messing up, but a single tree classifier trained on `10000` , using the `max-features 2` option, has chosen the `end_neighborhood` `43%` of the time here.
* reminder on the holdout data...
```python
In [26]: holdout_df.shape
Out[26]: (207076, 8)

In [27]: holdout_df.head()
Out[27]: 
        start_postal_code  start_sublocality  start_neighborhood  start_day  \
782530                  2                  2                  16          5   
939491                  5                  2                  16          0   
211090                  9                  2                  16          1   
290972                 29                  2                  17          4   
29890                   9                  2                  17          0   

        start_hour   age  gender  end_neighborhood  
782530          10  39.0       1                16  
939491          21  59.0       1                16  
211090          17  48.0       1                17  
290972          16  50.0       1                17  
29890            7  44.0       2                16  

In [28]: holdout_df.end_neighborhood.value_counts()
Out[28]: 
17    92189
16    70703
20     7468
21     6504
22     5062
18     2748
15     2581
7      2303
11     1968
19     1960
2      1662
12     1652
10     1576
9      1388
6      1267
4      1235
14     1185
13     1121
5       978
3       543
8       404
1       294
23      285
Name: end_neighborhood, dtype: int64