Open namoopsoo opened 7 years ago
working with a simple dataset from https://github.com/namoopsoo/learn-citibike/issues/13
In [39]: simpledf.head()
Out[39]:
start_postal_code start_sublocality start_neighborhood start_day \
0 1 2 17 3
1 1 2 17 3
2 1 2 17 3
3 1 2 17 3
4 1 2 17 3
start_hour age gender end_neighborhood
0 0 42.0 1 17
1 7 17.0 1 17
2 9 42.0 1 17
3 14 23.0 1 17
4 15 45.0 1 17
(def fname-medium
"datas/201510-citibike-tripdata.medium-simple.csv")
; get data as vector of records first.. (def simple-data (load-csv-data-to-maps fname-medium))
; then make an Incanter dataset... (def simple-dataset (incore/to-dataset simple-data))
; matrix-project.core=> (incore/col-names simple-dataset) ; [: :start_postal_code :start_sublocality :start_neighborhood :start_day :start_hour :age :gender :end_neighborhood]
; matrix-project.core=> (incore/nrow simple-dataset) ; 1035378
; fancy column selection, reminds me of pandas ; matrix-project.core=> (incore/ncol (incore/$ [:start_hour :age] simple-dataset)) ; 2
* also this is kind of impressive...
```clojure
matrix-project.core=> (incore/view simple-dataset)
(def simple-matrix (incore/to-matrix simple-dataset))
; matrix-project.core=> (def simple-matrix (incore/to-matrix simple-dataset)) ; #'matrix-project.core/simple-matrix
; selecting columns.. ; matrix-project.core=> (incore/$ [:start_postal_code :end_neighborhood] simple-dataset)
; selecting rows and columns.. matrix-project.core=> (incore/$ (range 5) [:start_postal_code :end_neighborhood] simple-dataset)
| :start_postal_code | :end_neighborhood | |--------------------+-------------------| | 1 | 17 | | 1 | 17 | | 1 | 17 | | 1 | 17 | | 1 | 17 |
* heh this is funny.. perhaps `pandas` as to `Incanter` as `numpy` is to `clatrix` ? Similar parallel evolution perhaps?
```clojure
; matrix-project.core=> (type simple-matrix)
; clatrix.core.Matrix
:$fn
,
(def random-half-dataset (incore/$where {:start_postal_code {incore/:$fn (fn [_] (< (rand) 0.5))}} simple-dataset))
(def random-half-dataset (incore/$where
{:start_postalcode {:incore/$fn (fn [] (< (rand) 0.5))}}
simple-dataset))
; hmm ... getting weird errors..
#### above random sampling not quite working. for now let me just split the data in half lazily since i know how to select rows..
```clojure
; matrix-project.core=> (incore/nrow simple-dataset)
; 1035378
; => 1035378 many rows..
; so first 80% will be training
; matrix-project.core=> (* 1035378 0.8)
; 828302.4
; => (range 828302) and (range 828302 1035378)
; Train
(time (def dumb-training-dataset
(incore/$ (range 828302) :all simple-dataset)))
; Holdout
; and last 20% , use as a holdout
(time (def dumb-holdout-dataset
(incore/$ (range 828302 1035378) :all simple-dataset)))
; Wow thats slow ^^^
; had to stop this because it was taking forever...
$fn
as part of incanter.core
but i think now it is more like a special keyword perhaps like the special :all
keyword...
; perhaps like this ? ...
(def training-dataset (incore/$where
{:start_postal_code {:$fn (fn [_] (< (rand) 0.8))}}
simple-dataset))
; hmm looks like that worked... and it was nearly instantaneous... matrix-project.core=> (def training-dataset (incore/$where
#_=> simple-dataset))
matrix-project.core=> (incore/nrow training-dataset) 827963
; Although now i would need to take the remaining rows for the second dataset. (def training-indices (incore/$ :index training-dataset)) (def training-indices-set (set training-indices))
(def test-dataset (incore/$where
{:index {:$fn (fn [x] (= (conj training-indices-set x) training-indices-set))}}
simple-dataset))
; hmm... this still isnt working... going to use python for the train test split for now...
; set membership ; (= (conj s :e) #{:a :b :c :d :e}) ; -> true
#### all together now..
* ..
```clojure
(def fname-medium-train
"datas/201510-citibike-tripdata.medium-simple-train.csv")
(def fname-medium-holdout
"datas/201510-citibike-tripdata.medium-simple-holdout.csv")
; get data as vector of records first..
(def simple-data-train (load-csv-data-to-maps fname-medium-train))
(def simple-data-holdout (load-csv-data-to-maps fname-medium-holdout))
; then make an Incanter dataset...
(def simple-dataset-train (incore/to-dataset simple-data-train))
(def simple-dataset-holdout (incore/to-dataset simple-data-holdout))
(def training-matrix (incore/to-matrix simple-dataset-train))
(def holdout-matrix (incore/to-matrix simple-dataset-holdout))
; classifier..
(time (def myclassifier (do-forest-train-classifier training-matrix)))
; matrix-project.core=> (time (def myclassifier (do-forest-train-classifier training-matrix)))
; "Elapsed time: 0.620564 msecs"
#'matrix-project.core/myclassifier
; ==> seems to be lazy heh..
(def outs (random-forest-predict myclassifier (map butlast holdout-matrix)))
matrix-project.core=> (def myclassifier (do-forest-train-classifier (take 10 training-matrix)))
#'matrix-project.core/myclassifier
matrix-project.core=>
matrix-project.core=> (def outs (random-forest-predict myclassifier (map butlast (take 10 holdout-matrix))))
matrix-project.core=> outs (8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0) matrix-project.core=>
* large scale crapped out...
```clojure
matrix-project.core=> (time (def myclassifier (do-forest-train-classifier training-matrix)))
"Elapsed time: 0.620564 msecs"
#'matrix-project.core/myclassifier
matrix-project.core=>
matrix-project.core=>
matrix-project.core=> (def outs (random-forest-predict myclassifier (map butlast holdout-matrix)))
CompilerException java.lang.StackOverflowError, compiling:(form-init210517701811841828.clj:1:11)
matrix-project.core=>
"the StackOverflowError in Java. This error is thrown to indicate that the application’s stack was exhausted, due to deep recursion."
Infer , this looks most promising so far: https://github.com/aria42/infer (although old hmm?)
the one I started trying random forest with : https://github.com/cloudkj/lambda-ml
deep learning libraries ?
Amazon and Google ML services recommended ... https://juxt.pro/blog/posts/machine-learning-with-clojure.html
This is the Weka based library: https://github.com/antoniogarrote/clj-ml
another Weka based library looks like , https://github.com/plandes/clj-ml-model
Tight TF-IDF implementation https://gist.github.com/aamar/3851520
Incanter may have a bayes model but not sure ... http://incanter.org/docs/api/#member-incanter.stats-predict
Another List: https://github.com/josephmisiti/awesome-machine-learning#clojure
numenta's libs https://github.com/htm-community/clortex
another neural net type library: https://github.com/thinktopic/cortex
(time (def myclassifier (do-tree-train-classifier training-matrix))) ; taking long time ..
; (time (def myclassifier (do-tree-train-classifier (take 10000 training-matrix))))
(def outs (decision-tree-predict myclassifier (map butlast holdout-matrix)))
; quick compare results...
(defn make-pairs [list1 list2] (partition 2 (interleave list1 list2)))
(def pairs (make-pairs outs (map last holdout-matrix)))
(count (filter true? (map #(apply = %1) pairs)))
matrix-project.core=> (time (def myclassifier (do-tree-train-classifier (take 10000 training-matrix))))
"Elapsed time: 129409.438842 msecs"
#'matrix-project.core/myclassifier
matrix-project.core=>
matrix-project.core=>
matrix-project.core=> (def outs (decision-tree-predict myclassifier (map butlast holdout-matrix)))
matrix-project.core=> (def pairs (make-pairs outs (map last holdout-matrix)))
matrix-project.core=> (first pairs) (8.0 7.0) matrix-project.core=> (def comparisons (map #(apply = %1) pairs))
matrix-project.core=> (first comparisons) false matrix-project.core=> (def true-comparisons (filter true? comparisons))
matrix-project.core=> (count true-comparisons) 89244 matrix-project.core=> (count holdout-matrix) 207076 matrix-project.core=> (float (/ 89244 207076)) 0.4309722
#### Comments
* Okay, so on commit `bcfc435` , it looks like although a super rudimentary forest classifier was messing up, but a single tree classifier trained on `10000` , using the `max-features 2` option, has chosen the `end_neighborhood` `43%` of the time here.
* reminder on the holdout data...
```python
In [26]: holdout_df.shape
Out[26]: (207076, 8)
In [27]: holdout_df.head()
Out[27]:
start_postal_code start_sublocality start_neighborhood start_day \
782530 2 2 16 5
939491 5 2 16 0
211090 9 2 16 1
290972 29 2 17 4
29890 9 2 17 0
start_hour age gender end_neighborhood
782530 10 39.0 1 16
939491 21 59.0 1 16
211090 17 48.0 1 17
290972 16 50.0 1 17
29890 7 44.0 2 16
In [28]: holdout_df.end_neighborhood.value_counts()
Out[28]:
17 92189
16 70703
20 7468
21 6504
22 5062
18 2748
15 2581
7 2303
11 1968
19 1960
2 1662
12 1652
10 1576
9 1388
6 1267
4 1235
14 1185
13 1121
5 978
3 543
8 404
1 294
23 285
Name: end_neighborhood, dtype: int64
matrix-project.core=> (keys myclassifier)
(:cost :prediction :weighted :min-split :min-leaf :max-features :parameters)
lambda-ml.random-forest