milestone 1: get basic data annotated using earlier geo annotation

"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender" "171","10/1/2015 00:00:02","10/1/2015 00:02:54","388","W 26 St & 10 Ave","40.749717753","-74.002950346","494","W 26 St & 8 Ave","40.74734825","-73.99723551","24302","Subscriber","1973","1"

Basic data...

Chose to make a dataset from just the start sublocality and end sublocality ( these are the start and end boroughs
```
In [133]: simpledf = annotate_geo.make_dead_simple_df(next_df)
 ...: 
```

In [134]: simpledf.shape Out[134]: (888989, 2)

In [135]: simpledf.head() Out[135]: start_sublocality end_sublocality 0 2 2 1 2 2 2 2 2 3 2 2 4 2 2

In [138]: simpledf.start_sublocality.value_counts() Out[138]: 2 790758 1 88920 3 9311 Name: start_sublocality, dtype: int64

* Here, `make_dead_simple_df()` uses simple borough mapping
```python
{'Brooklyn': 1, 'Manhattan': 2, 'Queens': 3}

Looking at the connections made in this October month...
```
In [147]: simpledf['unit'] = [1]*simpledf.shape[0]
```

In [149]: gpby = simpledf.groupby(by=['start_sublocality', 'end_sublocality'])

In [150]: gpby.count() Out[150]: unit start_sublocality end_sublocality
1 1 67863 2 17267 3 3790 2 1 16972 2 771653 3 2133 3 1 3595 2 1921 3 3795

In [151]: 771653/simpledf.shape[0] Out[151]: 0.8680118651636859


* So based on the above, the data is heavily weighed, 86.8% on trips within `Manhattan`.

namoopsoo / play-clj-ml

milestone 1: get basic data annotated using earlier geo annotation #1

Summary

Tasks

for later

Basic data...