Closed namoopsoo closed 6 years ago
In [133]: simpledf = annotate_geo.make_dead_simple_df(next_df)
...:
In [134]: simpledf.shape Out[134]: (888989, 2)
In [135]: simpledf.head() Out[135]: start_sublocality end_sublocality 0 2 2 1 2 2 2 2 2 3 2 2 4 2 2
In [138]: simpledf.start_sublocality.value_counts() Out[138]: 2 790758 1 88920 3 9311 Name: start_sublocality, dtype: int64
* Here, `make_dead_simple_df()` uses simple borough mapping
```python
{'Brooklyn': 1, 'Manhattan': 2, 'Queens': 3}
In [147]: simpledf['unit'] = [1]*simpledf.shape[0]
In [149]: gpby = simpledf.groupby(by=['start_sublocality', 'end_sublocality'])
In [150]: gpby.count()
Out[150]:
unit
start_sublocality end_sublocality
1 1 67863
2 17267
3 3790
2 1 16972
2 771653
3 2133
3 1 3595
2 1921
3 3795
In [151]: 771653/simpledf.shape[0] Out[151]: 0.8680118651636859
* So based on the above, the data is heavily weighed, 86.8% on trips within `Manhattan`.
Summary
zipcode
andborough
.Tasks
201510-citibike-tripdata.geotagged.csv
from201510-citibike-tripdata.csv
using my earlier annotation code.for later