Closed namoopsoo closed 5 years ago
root@9984a296c093:/opt/program# python train
/opt/conda/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/opt/conda/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
Exception("unknown dtype<type 'float'>",)
> /opt/conda/lib/python2.7/site-packages/bikelearn/classify.py(78)build_label_encoders_from_df()
77 else:
---> 78 raise Exception, 'unknown dtype' + str(dtype)
79
ipdb> pp feature_encoding ['start_postal_code', 'start_sublocality', 'start_neighborhood', 'end_neighborhood'] ipdb> pp feature in feature_encoding True ipdb> pp feature 'start_postal_code' ipdb> pp dtype <type 'float'> ipdb> pp dtype == float True ipdb> pp df[feature].head().values array([11249., 11249., 11249., 11249., 11249.]) ipdb>
#### ok... changed up that
#### also from quick run earlier...
```python
In [54]: clf = RandomForestClassifier(max_depth=2, random_state=0)
In [55]: clf.fit(X_train, y_train)
Out[55]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=2, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=0, verbose=0, warm_start=False)
In [57]: print clf.feature_importances_
[0.20919847 0.29880775 0.37548998 0.02124216 0.03291293 0.03416068
0.02818804]
python train
in the container manually.
root@d19ff4f4aa5e:/opt/program# python train
/opt/conda/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
/opt/conda/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/opt/conda/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
root@d19ff4f4aa5e:/opt/program# ipython
...
...
In [1]: import cPickle
...
In [10]: modelfn = '/opt/ml/model/tree-foo-bundle.pkl'
In [11]: model = cPickle.load(open(modelfn)) /opt/conda/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release. from numpy.core.umath_tests import inner1d
In [12]: model Out[12]: {'bundle_name': 'tree-foo-bundle.pkl', 'clf': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=2, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=0, verbose=0, warm_start=False), 'model_id': 'tree-foo'} In [13]: model['clf'].featureimportances Out[13]: array([0.34746472, 0.34359613, 0.22347966, 0.00633857, 0.01033549, 0.03439006, 0.03439537])
#### interpreting those importances again..
```python
cols = ['start_postal_code', 'start_sublocality', 'start_neighborhood', 'start_day', 'start_hour', 'age', 'gender']
start_day
, start_hour
are not impacting the decision whatsoever.2015/10
set and did a test using a 2016/01
set,(citilearnsage) $ docker run -v $(pwd)/local_test/test_dir:/opt/ml --rm citibike-learn-blah train
/opt/conda/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
/opt/conda/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/opt/conda/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
!missing stations: set(['Center Blvd\xc2\xa0& Borden Ave'])
(citilearnsage) $
(citilearnsage) $ ls -alrt local_test/test_dir/model/
total 184
drwxr-xr-x@ 5 michal staff 170 Jul 15 14:32 ..
-rw-r--r--@ 1 michal staff 6881 Jul 15 15:52 decision-tree-model.pkl
-rw-r--r--@ 1 michal staff 37030 Jul 29 15:15 tree-foo-bundle.2018-07-29T1915ZUTC.pkl
-rw-r--r--@ 1 michal staff 44124 Aug 12 13:12 tree-foo-bundle.2018-08-12T171255ZUTC.pkl
In [1]: fn = 'sagemaker/local_test/test_dir/model/tree-foo-bundle.2018-08-12T171255ZUT
...: C.pkl'
...:
In [2]: import cPickle
...: with open(fn) as fd:
...: bundle = cPickle.load(fd)
...: bundle
...:
Out[2]:
{'bundle_name': 'tree-foo-bundle',
'clf': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=2, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=0, verbose=0, warm_start=False),
'clf_info': {'feature_importances': [('start_postal_code',
0.28489345066586647),
('start_sublocality', 0.27812701364478254),
('start_neighborhood', 0.24554879740470287),
('start_day', 0.022793741554315638),
('start_hour', 0.004790427541855301),
('age', 0.002772245039587746),
('gender', 0.0739331660713863),
('usertype', 0.08714115807750318)]},
'evaluation': {'validation_proportion_correct': 0.9759847738514625},
'features': {'dtypes': {'age': float,
'end_neighborhood': str,
'start_neighborhood': str,
'start_postal_code': str,
'start_sublocality': str,
'usertype': str},
'input': ['start_postal_code',
'start_sublocality',
'start_neighborhood',
'start_day',
'start_hour',
'age',
'gender',
'usertype'],
'output_label': 'end_neighborhood'},
'label_encoders': {'age': LabelEncoder(),
'end_neighborhood': LabelEncoder(),
'start_neighborhood': LabelEncoder(),
'start_postal_code': LabelEncoder(),
'start_sublocality': LabelEncoder(),
'usertype': LabelEncoder()},
'model_id': 'tree-foo',
'timestamp': '2018-08-12T171255ZUTC',
'train_metadata': {'stations_df_fn': '/opt/ml/input/config/start_stations_103115.csv',
'trainset_fn': '/opt/ml/input/data/training/train.2018-07-28T210403.csv'}}
In [3]: import pandas as pd
...: holdout_df = pd.read_csv('/......./data/citibike/201601-citibike-tripdata.csv')
...: holdout_df.shape
...: import bikelearn.classify as blc
...:
/usr/local/miniconda3/envs/citilearnsage/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/usr/local/miniconda3/envs/citilearnsage/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
In [6]: import bikelearn.settings as s
...:
In [8]: import os
In [9]: stations_fn = os.path.join(s.DATAS_DIR, 'start_stations_103115.fuller.
...: csv')
...: stations_df = pd.read_csv(stations_fn, index_col=0, dtype={'postal_cod
...: e': str})
...:
In [10]: %time y_predictions, y_test = blc.run_model_predict(bundle, holdout_df, stati
...: ons_df)
CPU times: user 1min 9s, sys: 1.4 s, total: 1min 10s
Wall time: 59.3 s
In [11]: y_predictions.shape, y_test.shape
Out[11]: ((501173,), (501173, 1))
In [12]: zipped = zip(y_predictions, [v[0] for v in y_test])
In [13]: zipped[:5]
Out[13]: [(8, 0), (8, 0), (8, 0), (8, 0), (8, 0)]
In [14]: correct = len([[x,y] for x,y in zipped if x == y])
...:
In [15]: proportion_correct = 1.0*correct/y_predictions.shape[0]
In [16]: proportion_correct
Out[16]: 0.011076015667244645
# docker run <image> serve
# docker run -v $(pwd)/test_dir:/opt/ml --rm ${image} train
docker run -v $(pwd)/local_test/test_dir:/opt/ml -t -i citibike-learn-blah
docker run -v $(pwd)/local_test/test_dir:/opt/ml citibike-learn-blah serve
#### oops... kept getting a not found.. `OSError` when running the `serve`
* but it was because i forgot to uncomment the line to install `gunicorn`
Traceback (most recent call last):
File "./serve", line 71, in
#### try again...
(citilearnsage) $ docker run -v $(pwd)/local_test/test_dir:/opt/ml citibike-learn-blah serve Starting the inference server with 4 workers. [2018-09-09 15:17:28 +0000] [13] [INFO] Starting gunicorn 19.9.0 [2018-09-09 15:17:28 +0000] [13] [INFO] Listening at: unix:/tmp/gunicorn.sock (13) [2018-09-09 15:17:28 +0000] [13] [INFO] Using worker: gevent [2018-09-09 15:17:28 +0000] [17] [INFO] Booting worker with pid: 17 [2018-09-09 15:17:28 +0000] [18] [INFO] Booting worker with pid: 18 [2018-09-09 15:17:28 +0000] [20] [INFO] Booting worker with pid: 20 [2018-09-09 15:17:28 +0000] [22] [INFO] Booting worker with pid: 22
* ok nice workers spawned now.
### Ref
[1] https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html
$(aws --profile myblahprofile ecr get-login --no-include-email --region us-east-1)
docker tag citibike-learn-blah:latest citibike-learn-blah:0.1.0
docker tag citibike-learn-blah:latest xxxx.dkr.ecr.us-east-1.amazonaws.com/citibike-learn-blah:latest
docker push xxxx.dkr.ecr.us-east-1.amazonaws.com/citibike-learn-blah:latest
docker push xxxx.dkr.ecr.us-east-1.amazonaws.com/citibike-learn-blah:0.1.0
# serve locally..
docker run -p 8080:8080 -v $(pwd)/local_test/test_dir:/opt/ml citibike-learn-blah serve
# run interactively for local debugging..
docker run -p 8080:8080 -v $(pwd)/local_test/test_dir:/opt/ml -t -i citibike-learn-blah
ValidationException
when creating a new model custom SageMaker model
# custom role,
arn:aws:iam::<account id>:role/mySageMakerFullRole
8
. In [67]: for i in range(10):
...: data = make_data_dict_from_row_data(traindf.iloc[i].to_dict())
...: print data, call_api(data, url, headers).json()
...:
{'blah': {'start_station': 'Broadway & W 55 St', 'start_time': '10/8/2015 11:01:24', 'rider_gender': '0', 'rider_type': 'Customer', 'birth_year': ''}} {u'output': u'8\n'}
{'blah': {'start_station': 'Barclay St & Church St', 'start_time': '10/23/2015 11:55:30', 'rider_gender': '0', 'rider_type': 'Customer', 'birth_year': ''}} {u'output': u'8\n'}
{'blah': {'start_station': 'Broadway & W 58 St', 'start_time': '10/15/2015 19:28:09', 'rider_gender': '0', 'rider_type': 'Customer', 'birth_year': ''}} {u'output': u'8\n'}
{'blah': {'start_station': 'Berry St & N 8 St', 'start_time': '10/5/2015 11:34:35', 'rider_gender': '1', 'rider_type': 'Subscriber', 'birth_year': '1973'}} {u'output': u'8\n'}
{'blah': {'start_station': 'W 13 St & 7 Ave', 'start_time': '10/17/2015 20:57:15', 'rider_gender': '1', 'rider_type': 'Subscriber', 'birth_year': '1970'}} {u'output': u'8\n'}
{'blah': {'start_station': 'Broadway & W 60 St', 'start_time': '10/14/2015 17:03:13', 'rider_gender': '2', 'rider_type': 'Subscriber', 'birth_year': '1985'}} {u'output': u'8\n'}
{'blah': {'start_station': 'State St & Smith St', 'start_time': '10/17/2015 09:58:27', 'rider_gender': '2', 'rider_type': 'Subscriber', 'birth_year': '1985'}} {u'output': u'8\n'}
{'blah': {'start_station': 'W 37 St & 10 Ave', 'start_time': '10/6/2015 18:02:11', 'rider_gender': '1', 'rider_type': 'Subscriber', 'birth_year': '1973'}} {u'output': u'8\n'}
{'blah': {'start_station': 'Stanton St & Chrystie St', 'start_time': '10/22/2015 15:23:09', 'rider_gender': '1', 'rider_type': 'Subscriber', 'birth_year': '1989'}} {u'output': u'8\n'}
{'blah': {'start_station': 'Norfolk St & Broome St', 'start_time': '10/7/2015 22:32:43', 'rider_gender': '1', 'rider_type': 'Subscriber', 'birth_year': '1954'}} {u'output': u'8\n'}
8
.In [68]: outputs = []
In [69]: traindf.shape
Out[69]: (969821, 16)
In [71]: for _ in range(100):
...: i = random.randint(1111, 969820)
...: data = make_data_dict_from_row_data(traindf.iloc[i].to_dict())
...: # print data, call_api(data, url, headers).json()
...: out = call_api(data, url, headers)
...: try:
...: out_json = out.json()
...: except ValueError:
...: out_json = {'text': out.text}
...: except Exception as e:
...: out_json = {'text': out.text, 'e': str(e.message)}
...: outputs.append({'data': data, 'out_json': out_json, 'i': i})
...:
...:
In [72]: len(outputs)
Out[72]: 100
In [73]: outputs[0]
Out[73]:
{'data': {'blah': {'birth_year': '1993',
'rider_gender': '1',
'rider_type': 'Subscriber',
'start_station': 'Washington Pl & Broadway',
'start_time': '10/28/2015 13:09:23'}},
'i': 875240,
'out_json': {u'output': u'8\n'}}
In [75]: not8 = [x for x in outputs if x['out_json']['output'] != u'8\n']
In [76]: len(not8)
Out[76]: 0
In [79]: bundle['label_encoders']
Out[79]:
{'age': LabelEncoder(),
'end_neighborhood': LabelEncoder(),
'start_neighborhood': LabelEncoder(),
'start_postal_code': LabelEncoder(),
'start_sublocality': LabelEncoder(),
'usertype': LabelEncoder()}
In [80]: bundle['label_encoders']['end_neighborhood']
Out[80]: LabelEncoder()
In [81]: vars(bundle['label_encoders']['end_neighborhood'])
Out[81]:
{'classes_': array(['-1', 'Bedford-Stuyvesant', 'Brooklyn Heights',
'Downtown Brooklyn', 'Fort Greene', 'Greenpoint',
'Long Island City', 'Williamsburg', 'nan'], dtype=object)}
In [82]: len(vars(bundle['label_encoders']['end_neighborhood'])['classes_'])
Out[82]: 9
(Pdb) len(stations_df.neighborhood.value_counts())
23
(Pdb) stations_df.neighborhood.value_counts()
Lower Manhattan 116
Midtown 113
Williamsburg 42
Bedford-Stuyvesant 34
Upper East Side 20
Upper West Side 17
Greenpoint 15
Downtown Brooklyn 14
Fort Greene 13
Long Island City 13
Clinton Hill 10
Brooklyn Heights 10
Lower East Side 6
Midtown East 6
Midtown West 3
Hell's Kitchen 3
Lincoln Square 3
Central Park 3
Boerum Hill 2
East Village 2
Dumbo 2
Alphabet City 1
Yorkville 1
Name: neighborhood, dtype: int64
(Pdb)
* also.. stations df has a `13/465` which are `nan`. not ideal.
```python
(Pdb) pp stations_df.head()
station_name postal_code sublocality neighborhood state
0 W 26 St & 10 Ave 10001 Manhattan Midtown NY
1 E 39 St & 2 Ave 10016 Manhattan Midtown NY
2 8 Ave & W 52 St 10019 Manhattan Midtown NY
3 Sullivan St & Washington Sq 10012 Manhattan Lower Manhattan NY
4 Bedford Ave & Nassau Ave 11222 Brooklyn Greenpoint NY
(Pdb) pp stations_df[stations_df.neighborhood.isnull()].shape
(13, 5)
(Pdb) pp stations_df[stations_df.neighborhood.isnull()]
station_name postal_code sublocality neighborhood state
29 Clermont Ave & Lafayette Ave 11238 Brooklyn NaN NY
45 Montrose Ave & Bushwick Ave 11206 Brooklyn NaN NY
66 Nassau St & Navy St 11201 Brooklyn NaN NY
199 FDR Drive & E 35 St 10016 Manhattan NaN NY
203 Fulton St & Rockwell Pl 11201 Brooklyn NaN NY
230 York St & Jay St 11201 Brooklyn NaN NY
233 Carlton Ave & Flushing Ave 11205 Brooklyn NaN NY
328 Central Park West & W 68 St 10023 Manhattan NaN NY
336 Front St & Gold St 11201 Brooklyn NaN NY
392 Sands St & Navy St 11201 Brooklyn NaN NY
396 3 Ave & Schermerhorn St 11217 Brooklyn NaN NY
412 Clinton Ave & Flushing Ave 11205 Brooklyn NaN NY
440 Railroad Ave & Kay Ave 11249 Brooklyn NaN NY
In [98]: stations_df = bundle['train_metadata']['stations_df']
In [99]: stations_df.shape
Out[99]: (462, 6)
In [100]: stations_df.head()
Out[100]:
Unnamed: 0 station_name postal_code sublocality neighborhood state
0 0 W 26 St & 10 Ave NaN Manhattan NaN NY
1 1 E 39 St & 2 Ave NaN Manhattan NaN NY
2 2 8 Ave & W 52 St NaN Manhattan NaN NY
3 3 Sullivan St & Washington Sq NaN Manhattan NaN NY
4 4 Bedford Ave & Nassau Ave NaN Brooklyn NaN NY
In [101]: stations_df[stations_df.neighborhood.isnull()].shape
Out[101]: (421, 6)
nan
problem is fixed!nans
though.
url = 'https://rmuxqpksz2.execute-api.us-east-1.amazonaws.com/default/myBikelearnSageLambda' import requests def call_api(data, url, headers): out = requests.post(url, json=data, headers=headers) return out
outputs = [] for _ in range(100): i = random.randint(1111, 969820) data = make_data_dict_from_row_data(traindf.iloc[i].to_dict())
out = call_api(data, url, headers)
try:
out_json = out.json()
except ValueError:
out_json = {'text': out.text}
except Exception as e:
out_json = {'text': out.text, 'e': str(e.message)}
outputs.append({'data': data, 'out_json': out_json, 'i': i})
In [114]: Counter([x['out_json']['output'].strip() for x in outputs]) Out[114]: Counter({u'17': 100})
In [116]: with open('/Users/michal/Downloads/2018-10-21-comparingmodels/job-7/build/tree-foo-bundle.2018-10-21T205144ZUTC.pkl') as fd: ...: winterbundle = cPickle.load(fd) ...:
In [124]: winterbundle['label_encoders']['endneighborhood'].classes Out[124]: array(['-1', 'Alphabet City', 'Bedford-Stuyvesant', 'Boerum Hill', 'Brooklyn Heights', 'Central Park', 'Clinton Hill', 'Downtown Brooklyn', 'Dumbo', 'East Village', 'Fort Greene', 'Greenpoint', "Hell's Kitchen", 'Lincoln Square', 'Long Island City', 'Lower East Side', 'Lower Manhattan', 'Midtown', 'Midtown East', 'Midtown West', 'Murray Hill', 'Navy Yard', 'Upper East Side', 'Upper West Side', 'Vinegar Hill', 'Williamsburg', 'Yorkville'], dtype=object)
In [125]: len(winterbundle['label_encoders']['endneighborhood'].classes) Out[125]: 27
In [126]:
In [126]: winterbundle['evaluation'] Out[126]: {'validation_proportion_correct': 0.46358773861778346}
In [127]:
In [127]: simpledf.shape Out[127]: (969384, 31)
In [155]: print data {'blah': {'start_station': 'Forsyth St & Broome St', 'start_time': '10/8/2015 18:04:57', 'rider_gender': '2', 'rider_type': 'Subscriber', 'birth_year': '1973'}}
In [156]: foo = call_api(data, url, headers)
* So yea looks like that works. and ran on full dataset too..
In [138]: %time y_predictionsmany, y_testmany = blc.run_model_predict(winterbundle, atraindf, fullerstations_df, True) CPU times: user 2min 7s, sys: 4.41 s, total: 2min 11s Wall time: 2min 9s
In [139]: len(y_predictionsmany), len(y_testmany) Out[139]: (969384, 969384)
In [140]: %paste def get_basic_proportion_correct(y_test, y_predictions): zipped = zip(y_test, y_predictions) correct = len([[x,y] for x,y in zipped if x == y]) proportion_correct = 1.0*correct/y_test.shape[0] return proportion_correct
In [141]: get_basic_proportion_correct(y_predictionsmany, y_testmany) Out[141]: 0.4634850585526479
In [143]: Counter(y_predictionsmany) Out[143]: Counter({16: 123450, 17: 811252, 25: 34682})
nans
problem
In [138]: %time y_predictionsmany, y_testmany = blc.run_model_predict(winterbundle, atraindf, fullerstations_df, True)
CPU times: user 2min 7s, sys: 4.41 s, total: 2min 11s
Wall time: 2min 9s
In [139]:
In [139]: len(y_predictionsmany), len(y_testmany) Out[139]: (969384, 969384)
In [140]: %paste def get_basic_proportion_correct(y_test, y_predictions): zipped = zip(y_test, y_predictions) correct = len([[x,y] for x,y in zipped if x == y]) proportion_correct = 1.0*correct/y_test.shape[0] return proportion_correct
In [141]: get_basic_proportion_correct(y_predictionsmany, y_testmany) Out[141]: 0.4634850585526479
In [142]: Counter(y_predictionsmany[:10]) Out[142]: Counter({16: 1, 17: 9})
In [143]: Counter(y_predictionsmany) Out[143]: Counter({16: 123450, 17: 811252, 25: 34682})
* ,
```python
In [152]: print dict(Counter([x[0] for x in y_testmany]))
{1: 1231, 2: 9011, 3: 2798, 4: 7428, 5: 6826, 6: 6835, 7: 12503, 8: 4165, 9: 6277, 10: 9415, 11: 9750, 12: 7191, 13: 5080, 14: 5628, 15: 12146, 16: 322132, 17: 409018, 18: 12325, 19: 9641, 20: 2444, 21: 1467, 22: 39364, 23: 34895, 24: 629, 25: 29848, 26: 1337}
url = 'https://rmuxqpksz2.execute-api.us-east-1.amazonaws.com/default/myBikelearnSageLambda'
import requests
def call_api(data, url, headers):
out = requests.post(url, json=data, headers=headers)
return out
In [154]: %history 109
outputs = []
for _ in range(100):
i = random.randint(1111, 969820)
data = make_data_dict_from_row_data(traindf.iloc[i].to_dict())
# print data, call_api(data, url, headers).json()
out = call_api(data, url, headers)
try:
out_json = out.json()
except ValueError:
out_json = {'text': out.text}
except Exception as e:
out_json = {'text': out.text, 'e': str(e.message)}
outputs.append({'data': data, 'out_json': out_json, 'i': i})
In [155]: print data
{'blah': {'start_station': 'Forsyth St & Broome St', 'start_time': '10/8/2015 18:04:57', 'rider_gender': '2', 'rider_type': 'Subscriber', 'birth_year': '1973'}}
In [156]: foo = call_api(data, url, headers)
In [157]: foo
Out[157]: <Response [200]>
In [158]: print foo.json()
{u'output': u'17\n'}
In [160]: call_api(data, url, {})
Out[160]: <Response [200]>
In [161]: call_api(data, url, {}).json()
Out[161]: {u'output': u'17\n'}
> bikelearn/tests/test_metrics.py(22)test_basic()
-> pass
(Pdb) pp zip(y_test, sorted_outputs)
[(25, [17, 16, 25, 22, 23]),
(16, [17, 16, 22, 23, 18]),
(16, [17, 16, 22, 23, 25]),
(17, [17, 16, 22, 23, 18]),
(16, [17, 16, 22, 23, 25]),
(16, [17, 16, 22, 23, 18]),
(16, [17, 16, 22, 23, 18]),
(16, [17, 16, 22, 23, 18]),
(16, [17, 16, 25, 7, 22]),
(16, [17, 16, 22, 23, 25]),
(16, [17, 16, 22, 23, 25]),
(2, [17, 16, 25, 7, 22]),
(25, [17, 16, 22, 23, 25]),
(17, [17, 16, 22, 23, 25]),
(16, [17, 16, 22, 23, 25]),
(25, [17, 16, 25, 7, 22]),
(25, [17, 16, 22, 23, 25]),
(22, [17, 16, 22, 23, 25]),
(17, [17, 16, 22, 23, 25]),
(17, [17, 16, 22, 23, 18])]
'nan'
calculated in the proportion correct data..!missing stations: set(['MacDougal St & Prince St', 'Washington Square E', 'E 81 St & York Ave', 'Schermerhorn St & Court St', 'E 77 St & Park Ave', 'Center Blvd & Borden Ave', 'E 77 St & 3 Ave', 'E 71 St & 2 Ave', 'University Pl & E 8 St', 'Leonard St & Meeker Ave', 'PABT Valet', 'Broadway & Roebling St', 'E 80 St & 2 Ave', 'Union Ave & N 12 St', 'Monroe St & Tompkins Ave', 'E 40 St & 5 Ave', 'E 58 St & 1 Ave'])
> /opt/conda/lib/python2.7/site-packages/bikelearn/metrics_utils.py(14)do_validation()
13 classes = clf.classes_
---> 14 y_predict_proba = clf.predict_proba(X_validation)
15
ipdb> pp classes
array([1, 2, 3, 4, 5, 6, 7, 8])
ipdb> n
> /opt/conda/lib/python2.7/site-packages/bikelearn/metrics_utils.py(16)do_validation()
15
---> 16 metrics = gather_metrics(y_validation, y_predictions, y_predict_proba, classes)
17 return metrics
ipdb> from collections import Counter
ipdb> Counter(y_predictions)
Counter({8: 100235})
ipdb> Counter(y_validation)
Counter({8: 97977, 7: 862, 5: 364, 4: 357, 6: 255, 1: 174, 3: 130, 2: 116})
ipdb>
ipdb> pp skm.confusion_matrix(y_validation, y_predictions, classes)
array([[ 0, 0, 0, 0, 0, 0, 0, 174],
[ 0, 0, 0, 0, 0, 0, 0, 116],
[ 0, 0, 0, 0, 0, 0, 0, 130],
[ 0, 0, 0, 0, 0, 0, 0, 357],
[ 0, 0, 0, 0, 0, 0, 0, 364],
[ 0, 0, 0, 0, 0, 0, 0, 255],
[ 0, 0, 0, 0, 0, 0, 0, 862],
[ 0, 0, 0, 0, 0, 0, 0, 97977]])
ipdb>
nan
, but doh, that doesnt work because that is pre-label encoding!
> /opt/conda/lib/python2.7/site-packages/bikelearn/metrics_utils.py(48)get_proportion_correct()
47 zipped = zip(y_validation, y_predictions_validation)
---> 48 correct = len([[x,y] for x,y in zipped if x in y and y != 'nan'])
49 proportion_correct = 1.0*correct/y_validation.shape[0]
ipdb> pp zipped[:3] [(8, [8]), (8, [8]), (8, [8])] ipdb>
#### here..
```python
ipdb> pp label_encoders
{'age': LabelEncoder(),
'end_neighborhood': LabelEncoder(),
'start_neighborhood': LabelEncoder(),
'start_postal_code': LabelEncoder(),
'start_sublocality': LabelEncoder(),
'usertype': LabelEncoder()}
ipdb> pp label_encoders['end_neighborhood']
LabelEncoder()
ipdb> pp label_encoders['end_neighborhood'].classes_
array(['-1', 'Bedford-Stuyvesant', 'Brooklyn Heights',
'Downtown Brooklyn', 'Fort Greene', 'Greenpoint',
'Long Island City', 'Williamsburg', 'nan'], dtype=object)
ipdb>
assert not any(['nan' in le.classes_ for le in label_encoders.values()])
import bikelearn.settings as s
s.GOOGLE_GEO_API_KEY
address = "W 26 St & 10 Ave"
import bikelearn.get_station_geolocation_data as getgeo
address = "W 26 St & 10 Ave"
data = getgeo.get_geocoding_results(address)
help(getgeo.get_geocoding_results)
getgeo.get_geocoding_results??
import ipdb
getgeo.get_geocoding_results??
data = ipdb.runcall(getgeo.get_geocoding_results, address, request_type='geo')
## stations file
stations_json_filename = \
'datas/start_stations_103115.fuller.csv'
stationsdf = pd.read_csv(stations_json_filename, index_col=0)
foo = ipdb.runcall(getgeo.get_station_geoloc_data, stationsdf=stationsdf.iloc[:5])
In [84]: validsdf.levels.value_counts()
Out[84]:
4 428
2 301
3 95
1 56
0 3
Name: levels, dtype: int64
In [85]: validsdf['type'] = validsdf['address'].map(lambda x: 'geo' if x.endswith('NY'
...: ) else 'latlng')
In [86]: validsdf[validsdf['type'] == 'geo'].levels.value_counts()
Out[86]:
2 300
3 82
1 56
4 21
0 3
Name: levels, dtype: int64
In [87]: validsdf[validsdf['type'] == 'latlng'].levels.value_counts()
Out[87]:
4 407
3 13
2 1
Name: levels, dtype: int64
In [99]: getgeo.redis_client.hdel(s.GEO_RAW_RESULTS, *addresses_todelete)
Out[99]: 441
ipdb> pp geocoding_result
{u'error_message': u'You have exceeded your daily request quota for this API. If you did not set a custom daily request quota, verify your project has an active billing account: http://g.co/dev/maps-no-account',
u'results': [],
u'status': u'OVER_QUERY_LIMIT'}
E 3 St & 1 Ave, NY
for instance ...
ipdb> pp raw_results_list
[{u'address_components': [{u'long_name': u'East 3rd Street',
u'short_name': u'E 3rd St',
u'types': [u'route']},
{u'long_name': u'Manhattan',
u'short_name': u'Manhattan',
u'types': [u'political',
u'sublocality',
u'sublocality_level_1']},
{u'long_name': u'New York',
u'short_name': u'New York',
u'types': [u'locality', u'political']},
{u'long_name': u'New York County',
u'short_name': u'New York County',
u'types': [u'administrative_area_level_2',
u'political']},
{u'long_name': u'New York',
u'short_name': u'NY',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country', u'political']}],
u'formatted_address': u'E 3rd St, New York, NY, USA',
u'geometry': {u'bounds': {u'northeast': {u'lat': 40.7267011,
u'lng': -73.9778606},
u'southwest': {u'lat': 40.7203954,
u'lng': -73.99186399999999}},
u'location': {u'lat': 40.7236548, u'lng': -73.9852944},
u'location_type': u'GEOMETRIC_CENTER',
u'viewport': {u'northeast': {u'lat': 40.7267011,
u'lng': -73.9778606},
u'southwest': {u'lat': 40.7203954,
u'lng': -73.99186399999999}}},
u'place_id': u'ChIJuWov94JZwokRy-aQnT-VklI',
u'types': [u'route']}]
stationsdf = ipdb.runcall(getgeo.get_station_geoloc_data, stations[:50])
commit 438e425482db1c105ae6a22b8248696d0f91dfef
Date: Mon Oct 2 17:56:15 2017 -0400
...
&
are I think treated as query string parameters . Grr...stationsdf = ipdb.runcall(getgeo.get_station_geoloc_data, stations[:50])
In [167]: getgeo.extract_lat_lng_from_response(out.json()['results'])
Out[167]: {u'lat': 40.7957399, u'lng': -73.93892129999999}
'Clinton Ave & Flushing Ave'
. Should determine how many are like this and then perhaps move a a mile somewhere and try to get a closest match neighborhood if possiblestations = getgeo.extract_stations_from_files(filenames=[
"201510-citibike-tripdata.csv", "201601-citibike-tripdata.csv",
...
]
{u'plus_code': {u'compound_code': u'M2XJ+44 New York, NY, USA',
u'global_code': u'87G8M2XJ+44'},
u'results': [{u'address_components': [{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']}],
u'formatted_address': u'Clinton Ave & Flushing Ave, United States',
u'geometry': {u'location': {u'lat': 40.69794,
u'lng': -73.96986849999999},
u'location_type': u'GEOMETRIC_CENTER',
u'viewport': {u'northeast': {u'lat': 40.6992889802915,
u'lng': -73.9685195197085},
u'southwest': {u'lat': 40.6965910197085,
u'lng': -73.9712174802915}}},
u'place_id': u'ChIJKwagFMZbwokRtjuK5VRvb-E',
u'plus_code': {u'compound_code': u'M2XJ+53 New York, United States',
u'global_code': u'87G8M2XJ+53'},
u'types': [u'establishment', u'point_of_interest']},
{u'address_components': [{u'long_name': u'164',
u'short_name': u'164',
u'types': [u'street_number']},
{u'long_name': u'Flushing Avenue',
u'short_name': u'Flushing Ave',
u'types': [u'route']},
{u'long_name': u'Brooklyn',
u'short_name': u'Brooklyn',
u'types': [u'political',
u'sublocality',
u'sublocality_level_1']},
{u'long_name': u'Kings County',
u'short_name': u'Kings County',
u'types': [u'administrative_area_level_2',
u'political']},
{u'long_name': u'New York',
u'short_name': u'NY',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']},
{u'long_name': u'11205',
u'short_name': u'11205',
u'types': [u'postal_code']}],
u'formatted_address': u'164 Flushing Ave, Brooklyn, NY 11205, USA',
u'geometry': {u'location': {u'lat': 40.6976577,
u'lng': -73.9699304},
u'location_type': u'ROOFTOP',
u'viewport': {u'northeast': {u'lat': 40.6990066802915,
u'lng': -73.9685814197085},
u'southwest': {u'lat': 40.6963087197085,
u'lng': -73.9712793802915}}},
u'place_id': u'ChIJ3TPWP8ZbwokR8dlF4H6Vr8Q',
u'plus_code': {u'compound_code': u'M2XJ+32 New York, United States',
u'global_code': u'87G8M2XJ+32'},
u'types': [u'street_address']},
{u'address_components': [{u'long_name': u'168',
u'short_name': u'168',
u'types': [u'street_number']},
{u'long_name': u'Flushing Avenue',
u'short_name': u'Flushing Ave',
u'types': [u'route']},
{u'long_name': u'Brooklyn',
u'short_name': u'Brooklyn',
u'types': [u'political',
u'sublocality',
u'sublocality_level_1']},
{u'long_name': u'Kings County',
u'short_name': u'Kings County',
u'types': [u'administrative_area_level_2',
u'political']},
{u'long_name': u'New York',
u'short_name': u'NY',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']},
{u'long_name': u'11205',
u'short_name': u'11205',
u'types': [u'postal_code']}],
u'formatted_address': u'168 Flushing Ave, Brooklyn, NY 11205, USA',
u'geometry': {u'bounds': {u'northeast': {u'lat': 40.6976934,
u'lng': -73.96936649999999},
u'southwest': {u'lat': 40.697549,
u'lng': -73.969557}},
u'location': {u'lat': 40.6976345,
u'lng': -73.969481},
u'location_type': u'ROOFTOP',
u'viewport': {u'northeast': {u'lat': 40.6989701802915,
u'lng': -73.96811276970848},
u'southwest': {u'lat': 40.6962722197085,
u'lng': -73.9708107302915}}},
u'place_id': u'ChIJ43kQasZbwokRMEsGfYDYL6Y',
u'types': [u'premise']},
{u'address_components': [{u'long_name': u'99',
u'short_name': u'99',
u'types': [u'street_number']},
{u'long_name': u'Flushing Avenue',
u'short_name': u'Flushing Ave',
u'types': [u'route']},
{u'long_name': u'Brooklyn',
u'short_name': u'Brooklyn',
u'types': [u'political',
u'sublocality',
u'sublocality_level_1']},
{u'long_name': u'Kings County',
u'short_name': u'Kings County',
u'types': [u'administrative_area_level_2',
u'political']},
{u'long_name': u'New York',
u'short_name': u'NY',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']},
{u'long_name': u'11205',
u'short_name': u'11205',
u'types': [u'postal_code']}],
u'formatted_address': u'99 Flushing Ave, Brooklyn, NY 11205, USA',
u'geometry': {u'location': {u'lat': 40.6978344,
u'lng': -73.96973249999999},
u'location_type': u'RANGE_INTERPOLATED',
u'viewport': {u'northeast': {u'lat': 40.69918338029149,
u'lng': -73.96838351970848},
u'southwest': {u'lat': 40.69648541970849,
u'lng': -73.9710814802915}}},
u'place_id': u'Eig5OSBGbHVzaGluZyBBdmUsIEJyb29rbHluLCBOWSAxMTIwNSwgVVNBIhoSGAoUChIJtcpdFsZbwokRiFlwXEBWntUQYw',
u'types': [u'street_address']},
{u'address_components': [{u'long_name': u'Clinton Avenue',
u'short_name': u'Clinton Ave',
u'types': [u'route']},
{u'long_name': u'Brooklyn',
u'short_name': u'Brooklyn',
u'types': [u'political',
u'sublocality',
u'sublocality_level_1']},
{u'long_name': u'Kings County',
u'short_name': u'Kings County',
u'types': [u'administrative_area_level_2',
u'political']},
{u'long_name': u'New York',
u'short_name': u'NY',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']},
{u'long_name': u'11205',
u'short_name': u'11205',
u'types': [u'postal_code']}],
u'formatted_address': u'Clinton Ave, Brooklyn, NY 11205, USA',
u'geometry': {u'bounds': {u'northeast': {u'lat': 40.6978344,
u'lng': -73.96973249999999},
u'southwest': {u'lat': 40.6977702,
u'lng': -73.9697406}},
u'location': {u'lat': 40.6978023,
u'lng': -73.96973659999999},
u'location_type': u'GEOMETRIC_CENTER',
u'viewport': {u'northeast': {u'lat': 40.6991512802915,
u'lng': -73.96838756970848},
u'southwest': {u'lat': 40.6964533197085,
u'lng': -73.9710855302915}}},
u'place_id': u'ChIJu0sva8ZbwokRrMKu_ii2I2Y',
u'types': [u'route']},
{u'address_components': [{u'long_name': u'11205',
u'short_name': u'11205',
u'types': [u'postal_code']},
{u'long_name': u'Brooklyn',
u'short_name': u'Brooklyn',
u'types': [u'political',
u'sublocality',
u'sublocality_level_1']},
{u'long_name': u'Kings County',
u'short_name': u'Kings County',
u'types': [u'administrative_area_level_2',
u'political']},
{u'long_name': u'New York',
u'short_name': u'NY',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']}],
u'formatted_address': u'Brooklyn, NY 11205, USA',
u'geometry': {u'bounds': {u'northeast': {u'lat': 40.7066249,
u'lng': -73.948331},
u'southwest': {u'lat': 40.68741490000001,
u'lng': -73.980632}},
u'location': {u'lat': 40.6945036,
u'lng': -73.9565551},
u'location_type': u'APPROXIMATE',
u'viewport': {u'northeast': {u'lat': 40.7066249,
u'lng': -73.948331},
u'southwest': {u'lat': 40.68741490000001,
u'lng': -73.980632}}},
u'place_id': u'ChIJLywE8sZbwokRiLapmmo79YU',
u'types': [u'postal_code']},
{u'address_components': [{u'long_name': u'Kings County',
u'short_name': u'Kings County',
u'types': [u'administrative_area_level_2',
u'political']},
{u'long_name': u'Brooklyn',
u'short_name': u'Brooklyn',
u'types': [u'political',
u'sublocality',
u'sublocality_level_1']},
{u'long_name': u'New York',
u'short_name': u'NY',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']}],
u'formatted_address': u'Kings County, Brooklyn, NY, USA',
u'geometry': {u'bounds': {u'northeast': {u'lat': 40.739446,
u'lng': -73.8333651},
u'southwest': {u'lat': 40.551042,
u'lng': -74.05663}},
u'location': {u'lat': 40.6528762,
u'lng': -73.95949399999999},
u'location_type': u'APPROXIMATE',
u'viewport': {u'northeast': {u'lat': 40.739446,
u'lng': -73.8333651},
u'southwest': {u'lat': 40.551042,
u'lng': -74.05663}}},
u'place_id': u'ChIJOwE7_GTtwokRs75rhW4_I6M',
u'types': [u'administrative_area_level_2', u'political']},
{u'address_components': [{u'long_name': u'Brooklyn',
u'short_name': u'Brooklyn',
u'types': [u'political',
u'sublocality',
u'sublocality_level_1']},
{u'long_name': u'Kings County',
u'short_name': u'Kings County',
u'types': [u'administrative_area_level_2',
u'political']},
{u'long_name': u'New York',
u'short_name': u'NY',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']}],
u'formatted_address': u'Brooklyn, NY, USA',
u'geometry': {u'bounds': {u'northeast': {u'lat': 40.739446,
u'lng': -73.8333651},
u'southwest': {u'lat': 40.551042,
u'lng': -74.05663}},
u'location': {u'lat': 40.6781784,
u'lng': -73.9441579},
u'location_type': u'APPROXIMATE',
u'viewport': {u'northeast': {u'lat': 40.739446,
u'lng': -73.8333651},
u'southwest': {u'lat': 40.551042,
u'lng': -74.05663}}},
u'place_id': u'ChIJCSF8lBZEwokRhngABHRcdoI',
u'types': [u'political',
u'sublocality',
u'sublocality_level_1']},
{u'address_components': [{u'long_name': u'New York',
u'short_name': u'New York',
u'types': [u'locality',
u'political']},
{u'long_name': u'New York',
u'short_name': u'NY',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']}],
u'formatted_address': u'New York, NY, USA',
u'geometry': {u'bounds': {u'northeast': {u'lat': 40.9175771,
u'lng': -73.70027209999999},
u'southwest': {u'lat': 40.4773991,
u'lng': -74.25908989999999}},
u'location': {u'lat': 40.7127753,
u'lng': -74.0059728},
u'location_type': u'APPROXIMATE',
u'viewport': {u'northeast': {u'lat': 40.9175771,
u'lng': -73.70027209999999},
u'southwest': {u'lat': 40.4773991,
u'lng': -74.25908989999999}}},
u'place_id': u'ChIJOwg_06VPwokRYv534QaPC8g',
u'types': [u'locality', u'political']},
{u'address_components': [{u'long_name': u'New York',
u'short_name': u'NY',
u'types': [u'administrative_area_level_1',
u'political']},
{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']}],
u'formatted_address': u'New York, USA',
u'geometry': {u'bounds': {u'northeast': {u'lat': 45.015865,
u'lng': -71.777491},
u'southwest': {u'lat': 40.4773991,
u'lng': -79.7625901}},
u'location': {u'lat': 43.2994285,
u'lng': -74.21793260000001},
u'location_type': u'APPROXIMATE',
u'viewport': {u'northeast': {u'lat': 45.015865,
u'lng': -71.777491},
u'southwest': {u'lat': 40.4773991,
u'lng': -79.7625901}}},
u'place_id': u'ChIJqaUj8fBLzEwRZ5UY3sHGz90',
u'types': [u'administrative_area_level_1', u'political']},
{u'address_components': [{u'long_name': u'United States',
u'short_name': u'US',
u'types': [u'country',
u'political']}],
u'formatted_address': u'United States',
u'geometry': {u'bounds': {u'northeast': {u'lat': 71.5388001,
u'lng': -66.885417},
u'southwest': {u'lat': 18.7763,
u'lng': 170.5957}},
u'location': {u'lat': 37.09024,
u'lng': -95.712891},
u'location_type': u'APPROXIMATE',
u'viewport': {u'northeast': {u'lat': 71.5388001,
u'lng': -66.885417},
u'southwest': {u'lat': 18.7763,
u'lng': 170.5957}}},
u'place_id': u'ChIJCzYy5IS16lQRQrfeQ5K5Oxw',
u'types': [u'country', u'political']}],
u'status': u'OK'}
In [421]: sorted(dedupeddf['start station name'].value_counts().to_dict().items(), key
...: =lambda x: x[1])[-1]
Out[421]: ('Kent Ave & N 7 St', 3)
In [422]: dedupeddf[dedupeddf['start station name'] == 'Kent Ave & N 7 St']
Out[422]:
start station name ... latlng
360 Kent Ave & N 7 St ... 40.7207255341,-73.9612591267
433 Kent Ave & N 7 St ... 40.72057658,-73.96150225
441 Kent Ave & N 7 St ... 40.720367753,-73.9616507292
[3 rows x 4 columns]
In [423]: dedupeddf[dedupeddf['start station name'] == 'Kent Ave & N 7 St'].values
Out[423]:
array([['Kent Ave & N 7 St', 40.720725534125954, -73.9612591266632,
'40.7207255341,-73.9612591267'],
['Kent Ave & N 7 St', 40.72057658, -73.96150225,
'40.72057658,-73.96150225'],
['Kent Ave & N 7 St', 40.72036775298455, -73.96165072917937,
'40.720367753,-73.9616507292']], dtype=object)
%time stationsdf = getgeo.extract_stations_latlng_df_from_files(filenames=[
...
])
# Wall time: 1min 25s
# In [414]: stationsdf.shape
# Out[414]: (7127, 4)
dedupeddf = getgeo.some_stationdf_dedupe(stationsdf)
# In [431]: dedupeddf.shape
# Out[431]: (682, 4)
annotated_df = getgeo.annotate_station_df(dedupeddf)
newdf = pd.concat([dedupeddf, foodf], axis=1)
In [563]: handmade
Out[563]:
[{'neighborhood': 'Fort Greene', 'station': 'DeKalb Ave & Hudson Ave'},
{'neighborhood': u'NoMad',
'postal_code': u'10001',
'state': u'NY',
'station': 'Broadway & W 29 St',
'sublocality': u'Manhattan'},
{'neighborhood': 'Brooklyn Navy Yard', 'station': 'Sands St & Gold St'},
{'neighborhood': 'Dumbo', 'station': 'York St & Jay St'},
{'neighborhood': 'Brooklyn Navy Yard',
'station': 'Flushing Ave & Carlton Ave'},
{'neighborhood': 'Columbia Street Waterfront District',
'station': 'Atlantic Ave & Furman St'},
{'neighborhood': 'Brooklyn Navy Yard', 'station': 'Railroad Ave & Kay Ave'},
{'neighborhood': 'Brooklyn Navy Yard',
'station': 'Clinton Ave & Flushing Ave'},
{'neighborhood': 'Park Slope', 'station': 'Dean St & 4 Ave'},
{'neighborhood': 'Brooklyn Navy Yard', 'station': 'Nassau St & Navy St'},
{'neighborhood': 'Vinegar Hill', 'station': 'Front St & Gold St'},
{'neighborhood': 'Brooklyn Navy Yard', 'station': '7 Ave & Farragut St'},
{'neighborhood': 'Brooklyn Navy Yard', 'station': 'Sands St & Navy St'},
{'neighborhood': 'Brooklyn Navy Yard',
'station': 'Carlton Ave & Flushing Ave'},
{'neighborhood': 'Financial District', 'station': 'Peck Slip & Front St'},
{'neighborhood': 'Prospect Heights',
'station': 'Bike The Branches - Central Branch'},
{'neighborhood': 'Cobble Hill', 'station': 'Henry St & Degraw St'},
{'neighborhood': 'Park Slope', 'station': 'Union St & 4 Ave'},
{'neighborhood': 'Park Slope', 'station': 'Douglass St & 4 Ave'},
{'neighborhood': 'Prospect Park',
'station': 'Bike in Movie Night | Prospect Park Bandshell'},
{'neighborhood': 'Park Slope', 'station': 'West Drive & Prospect Park West'},
{'neighborhood': 'Park Slope', 'station': '4 Ave & 2 St'}]
handmadedf = handmadedf.rename(columns={'neighborhood': 'hand_neighborhood'})
moredf = newdf.merge(handmadedf, left_on='start station name', right_on='station', how='left')
# update those 22 hand made neighborhood labels.
moredf.ix[moredf[moredf.neighborhood.isnull()].index.tolist(),'neighborhood'] = moredf.ix[moredf[moredf.neighborhood.isnull()].index.tolist(), 'hand_neighborhood']
moredf.drop(labels=['sublocality_y', 'state_y', 'postal_code_y'],axis=1, inplace=True)
moredf.rename(columns={'sublocality_x':'sublocality','state_x':'state','postal_code_x': 'postal_code'},inplace=True)
moredf.drop(labels=['hand_neighborhood', 'station'],axis=1, inplace=True)
In [602]: fn = '/........./learn-citibike/datas/station
...: s/stations-2018-12-04-b.pkl'
In [603]: moredf.to_pickle(fn)
In [604]: previouslymissing = ['MacDougal St & Prince St', 'Washington Square E', 'E 8
...: 1 St & York Ave', 'Schermerhorn St & Court St', 'E 77 St & Park Ave', 'Cente
...: r Blvd & Borden Ave', 'E 77 St & 3 Ave', 'E 71 St & 2 Ave', 'University Pl &
...: E 8 St', 'Leonard St & Meeker Ave', 'PABT Valet', 'Broadway & Roebling St',
...: 'E 80 St & 2 Ave', 'Union Ave & N 12 St', 'Monroe St & Tompkins Ave', 'E 40
...: St & 5 Ave', 'E 58 St & 1 Ave']
In [605]: moredf['start station name'].value_counts().shape
Out[605]: (682,)
In [606]: moredf.shape
Out[606]: (682, 9)
In [607]: moredf[moredf['start station name'].isin(previouslymissing)].shape
Out[607]: (17, 9)
In [608]: len(previouslymissing)
Out[608]: 17
from sklearn import tree
from sklearn.externals.six import StringIO
import pydot
clffnpdf = 'myfile.pdf'
dot_data = StringIO()
# First one
clf = bundle['clf'].estimators_[0]
tree.export_graphviz(clf, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf(clffnpdf)
/invocations
endpoint to support labeled data in order to do batch model validation on new datasets, 201603-citibike-tripdata.csv
170MiB
file into S3 and specified this as the input , I got ,
2018/12/09 21:38:02 [error] 9#9: *4 client intended to send too large body: 6291424 bytes, client: 169.254.255.130, server: , request: "POST /invocations HTTP/1.1", host: "169.254.255.131:8080"
25.2MiB
, but how would that help if the above error seems to complain about 6291424 bytes
~ 6 MiB
6MiB
is some kind of a defaultnginx.conf
uses client_max_body_size 5m;
8m
and I will retry. But that means I need to update my csv df hydrate function to be headerless.201603-citibike-tripdata.csv
170MiB
input file was split into batches of 8MiB
, and the outputs are json of the form, {"confusion_matrix": [[0, 0,...],[],...]}
, so the output which Batch has saved is a concatenation of these,
{"confusion_matrix": [[0, 0,...],[],...]}{"confusion_matrix": [[0, 0,...],[],...]}{"confusion_matrix": [[0, 0,...],[],...]}{"confusion_matrix": [[0, 0,...],[],...]}
$ ag --count 'confusion' ~/Downloads/201603-citibike-tripdata.csv.out
29
import json
def unpack_oneliners(concatenated):
elements = concatenated[1:-1].split('}{')
normalized = ['{' + x + '}' for x in elements]
return [json.loads(x) for x in normalized]
def unpack_multiliners(concatenated):
elements = concatenated.split('\n')
elements_no_empties = [x for x in elements if x != '']
return [json.loads(x) for x in elements_no_empties]
def unpack_concatenated(concatenated):
if '}{' in concatenated:
return unpack_oneliners(concatenated)
elif '\n' in concatenated:
return unpack_multiliners(concatenated)
raise Exception('Shouldnt be here')
def read_file(fn):
with open(fn) as fd:
return fd.read()
local_filenames = [
'/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201603-citibike-tripdata.csv.out',
'/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201604-citibike-tripdata.csv.out',
'/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201605-citibike-tripdata.csv.out',
'/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201606-citibike-tripdata.csv.out',
'/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201607-citibike-tripdata.csv.out',
'/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201608-citibike-tripdata.csv.out',]
batch_outputs = [read_file(x) for x in local_filenames]
unpacked = [unpack_concatenated(x) for x in batch_outputs]
rank_proba_scores = [
[x.get("rank_k_proba_scores").get('10') for x in vec]
for vec in unpacked]
monthly_means = [np.mean(vec) for vec in rank_proba_scores]
rank_k_proba_scores = [x.get("rank_k_proba_scores") for x in unpack_concatenated(concatenated_data)]
len(rank_k_proba_scores)
[x.get('10') for x in rank_k_proba_scores]
Interestingly, this doesn't seem to be degrading. I Should look deeper for possible scoring problems.
In [127]: zip([x.split('/')[-1] for x in local_filenames], monthly_means)
Out[127]:
[('201603-citibike-tripdata.csv.out', 0.6159115590305551),
('201604-citibike-tripdata.csv.out', 0.6086896594073963),
('201605-citibike-tripdata.csv.out', 0.5982699463742136),
('201606-citibike-tripdata.csv.out', 0.5963199010718998),
('201607-citibike-tripdata.csv.out', 0.6042312393820078),
('201608-citibike-tripdata.csv.out', 0.5854122761160239)]
I can also look at the spread factor of the confusion matrices in these outputs as well.
unpacked[-1]
, or the last Batch output , 201608
.print [sorted(zip(range(64), xconfdf.sum()), key=lambda x:x[1])[-1] for xconfdf in [pd.DataFrame.from_records(y.get('confusion_matrix')) for y in unpacked[-1]]]
In [149]: print [sorted(zip(range(64), xconfdf.sum()), key=lambda x:x[1])[-1] for xcon
...: fdf in [pd.DataFrame.from_records(y.get('confusion_matrix')) for y in unpack
...: ed[-1]]]
[(9, 28625), (9, 27594), (9, 27853), (9, 27147), (9, 27642), (9, 27510), (9, 27963), (9, 27361), (9, 26306), (9, 24850), (9, 23930), (9, 27256), (9, 27718), (9, 27702), (9, 27121), (9, 27274), (9, 27542), (9, 27119), (9, 26991), (9, 25190), (9, 24916), (9, 26978), (9, 27208), (9, 27420), (9, 26936), (9, 27435), (9, 27285), (9, 26947), (9, 26673), (9, 25079), (9, 24164), (9, 25477), (9, 27021), (9, 26625), (9, 27119), (9, 26689), (9, 26947), (9, 26757), (9, 26867), (9, 26578), (9, 24042), (9, 23567), (9, 23287), (9, 26644), (9, 26561), (9, 26938), (9, 26166), (9, 26878), (9, 6395)]
9
?import cPickle
import bikelearn.classify as blc
fn = '/Users/michal/Downloads/2018-12-07-update-model/2018-12-07-update-model/tree-foo-bundle-pensive-swirles.2018-12-04T210259ZUTC.pkl'
with open(fn) as fd: bundle = cPickle.load(fd)
# label_encoder for the end neighborhood..
blc.label_decode(bundle['label_encoders']['end_neighborhood'], [9])
# this does the , label_encoder.inverse_transform(vec)
In [159]: blc.label_decode(bundle['label_encoders']['end_neighborhood'], [9])
/usr/local/miniconda3/envs/citilearnsage/lib/python2.7/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
Out[159]: array(['Central Park'], dtype=object)
* What the heck `Central Park` heh. can this really be true? Pretty weird.
'201603
through 201608
data. I can, also do the evaluation on 201609
through 201712
. There should damn well be some kind of deterioration, since at least because many new stations should be missing from the input data? This perhaps though would mean I need to update the station csv? Not sure. The functionality needs to be present such that when taking a dataset which has stations which are not in the stations file, that the data is hmm encoded as -1
right, as missing?(citilearnsage) $ head ../data/citibike/201610-citibike-tripdata.csv
Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
328,2016-10-01 00:00:07,2016-10-01 00:05:35,471,Grand St & Havemeyer St,40.71286844,-73.95698119,3077,Stagg St & Union Ave,40.70877084,-73.95095259,25254,Subscriber,1992,1
398,2016-10-01 00:00:11,2016-10-01 00:06:49,3147,E 85 St & 3 Ave,40.77801203,-73.95407149,3140,1 Ave & E 78 St,40.77140426,-73.9535166,17810,Subscriber,1988,2
430,2016-10-01 00:00:14,2016-10-01 00:07:25,345,W 13 St & 6 Ave,40.73649403,-73.99704374,470,W 20 St & 8 Ave,40.74345335,-74.00004031,20940,Subscriber,1965,1
351,2016-10-01 00:00:21,2016-10-01 00:06:12,3307,West End Ave & W 94 St,40.7941654,-73.974124,3357,W 106 St & Amsterdam Ave,40.8008363,-73.9664492472,19086,Subscriber,1993,1
2693,2016-10-01 00:00:21,2016-10-01 00:45:15,3428,8 Ave & W 16 St,40.740983,-74.001702,3323,W 106 St & Central Park West,40.7981856,-73.9605909006,26502,Subscriber,1991,1
513,2016-10-01 00:00:28,2016-10-01 00:09:02,433,E 13 St & Avenue A,40.72955361,-73.98057249,151,Cleveland Pl & Spring St,40.722103786686034,-73.99724900722504,25800,Subscriber,1995,1
601,2016-10-01 00:00:51,2016-10-01 00:10:52,3314,W 95 St & Broadway,40.7937704,-73.971888,3374,Central Park North & Adam Clayton Powell Blvd,40.799484,-73.955613,15985,Subscriber,1972,2
563,2016-10-01 00:00:54,2016-10-01 00:10:18,453,W 22 St & 8 Ave,40.74475148,-73.99915362,485,W 37 St & 5 Ave,40.75038009,-73.98338988,26018,Subscriber,1984,1
439,2016-10-01 00:00:54,2016-10-01 00:08:13,534,Water - Whitehall Plaza,40.70255065,-74.0127234,360,William St & Pine St,40.70717936,-74.00887308,15374,Subscriber,1968,1
(citilearnsage) $
(citilearnsage) $ head -2 ../data/citibike/201605-citibike-tripdata.csv
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"
"538","5/1/2016 00:00:03","5/1/2016 00:09:02","536","1 Ave & E 30 St","40.74144387","-73.97536082","497","E 17 St & Broadway","40.73704984","-73.99009296","23097","Subscriber","1986","2"
2018/12/09 21:38:02 [error] 9#9: *4 client intended to send too large body: 6291424 bytes, client: 169.254.255.130, server: , request: "POST /invocations HTTP/1.1", host: "169.254.255.131:8080"
nginx.conf
from client_max_body_size 5m;
to client_max_body_size 8m;
, trying 201603
again worked. 201609
failed , pensive-swirles-job-201609
, because of the error ClientError: S3 key: s3://my-sagemaker-blah/bikelearn/datasets/uncompressed/201609-citibike-tripdata.csv matched no files on s3
, so that one was a simple fix , pensive-swirles-job-201610
, 201611
and beyond failed with the error ,
169.254.255.130 - - [24/Dec/2018:04:20:48 +0000] "POST /invocations HTTP/1.1" 500 291 "-" "Go-http-client/1.1"
ValueError: time data '2016-10-01 00:00:07' does not match format '%m/%d/%Y %H:%M:%S'
* Because the file formats on `201610`, `201611`, `201612` as far as i can tell have changed the date formats.
* I corrected the Docker code reading dates from `pensive-swirles-2-11` to `pensive-swirles-2-12` and the problem went away.
#### But furthermore...
* I then automated the batch transform w/ the API , but the first time around job `Batch-Transform-2019-01-06-233207` failed because i forgot to include my special ENVIRONMENT variable, `DO_VALIDATION` for indicating validation.
* I did add this for the job `Batch-Transform-2019-01-06-235825` and that one was successful.
#### but then the log ...
* for `Batch-Transform-2019-01-06-235833` , has a mix bag of success and also timeouts,
2019/01/07 00:02:10 [error] 9#9: *10 upstream timed out (110: Connection timed out) while sending request to upstream, client: 169.254.255.130, server: , request: "POST /invocations HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock/invocations", host: "169.254.255.131:8080"
* And i dont see the typical output file generated for this on s3, so i dont think this finished.
import upload_client as uc
inputs = ['201611-citibike-tripdata.csv.gz', '201612-citibike-tripdata.csv.gz', '201701-citibike-tripdata.csv.gz']
for inputfile in inputs:
print 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfile
for inputfile in inputs:
path = 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfile
print path
print uc.start_batch_transform_job(input_location=path)
import time
time.sleep(2)
In [17]: reload(uc)
Out[17]: <module 'update_client' from 'update_client.py'>
In [18]: inputs = ['201611-citibike-tripdata.csv.gz', '201612-citibike-tripdata.csv.gz
...: ', '201701-citibike-tripdata.csv.gz']
In [19]: for inputfile in inputs:
...: print 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfile
...:
...:
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201611-citibike-tripdata.csv.gz
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201612-citibike-tripdata.csv.gz
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201701-citibike-tripdata.csv.gz
In [20]: for inputfile in inputs:
...: path = 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfil
...: e
...: print path
...: print uc.start_batch_transform_job(input_location=path)
...: import time
...: time.sleep(2)
...:
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201611-citibike-tripdata.csv.gz
batch_job_name , Batch-Transform-2019-01-07-001813
created transform job with name: Batch-Transform-2019-01-07-001813
None
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201612-citibike-tripdata.csv.gz
batch_job_name , Batch-Transform-2019-01-07-001815
created transform job with name: Batch-Transform-2019-01-07-001815
None
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201701-citibike-tripdata.csv.gz
batch_job_name , Batch-Transform-2019-01-07-001818
created transform job with name: Batch-Transform-2019-01-07-001818
None
In [21]:
General plan
bikelearn.tar.gz
pip package which can be installed in a Docker container, to run a training job,make python package ..
Docker build..
dockerfiles/Dockerfile
to reflect new package ,dist/bikelearn-0.1.2.tar.gz
, if needed.local_test/test_dir/model/bundle_meta.json
if needed, with new model nameRun serve endpoint
Make datasets ,
Test running the train job from inside the container,
docker run -v $(pwd)/test_dir:/opt/ml --rm ${image} train