namoopsoo / learn-citibike

0 stars 0 forks source link

SageMaker model retrain #16

Closed namoopsoo closed 5 years ago

namoopsoo commented 6 years ago

General plan

make python package ..

source activate citilearnsage
(citilearnsage) $ python setup.py sdist
(citilearnsage) $ cp dist/bikelearn-0.1.2.tar.gz  sagemaker/mypackages/

Docker build..


#### Train job
* from within `sagemaker` dir...
```bash
(citilearnsage) $ export image=citibike-learn-blah
(citilearnsage) $  docker run -v $(pwd)/local_test/test_dir:/opt/ml --rm ${image} train

Run serve endpoint

Make datasets ,

python make_datasets.py
(citilearnsage) $ python make_datasets.py ${MY_LOCAL_DATA_DIR}/201510-citibike-tripdata.csv ${MY_BIKE_REPO_DIR}/sagemaker/local_test/test_dir/input/data/training/
# => 

Test running the train job from inside the container,

# go into learn-citibike/sagemaker

(citilearnsage) $ docker run -v $(pwd)/local_test/test_dir:/opt/ml -t -i citibike-learn-blah 
root@d38ba643e4ce:/opt/program# 
root@d38ba643e4ce:/opt/program#  ipython

from bikelearn.models import treefoo
namoopsoo commented 6 years ago

small snafu with zip code being perceived as a float for some reason...

ipdb> pp feature_encoding ['start_postal_code', 'start_sublocality', 'start_neighborhood', 'end_neighborhood'] ipdb> pp feature in feature_encoding True ipdb> pp feature 'start_postal_code' ipdb> pp dtype <type 'float'> ipdb> pp dtype == float True ipdb> pp df[feature].head().values array([11249., 11249., 11249., 11249., 11249.]) ipdb>


#### ok... changed up that 

#### also from quick run earlier...
```python
In [54]: clf = RandomForestClassifier(max_depth=2, random_state=0)

In [55]: clf.fit(X_train, y_train)
Out[55]: 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [57]: print clf.feature_importances_
[0.20919847 0.29880775 0.37548998 0.02124216 0.03291293 0.03416068
 0.02818804]
namoopsoo commented 6 years ago

ok yay, got a model trained .

In [11]: model = cPickle.load(open(modelfn)) /opt/conda/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release. from numpy.core.umath_tests import inner1d

In [12]: model Out[12]: {'bundle_name': 'tree-foo-bundle.pkl', 'clf': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=2, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=0, verbose=0, warm_start=False), 'model_id': 'tree-foo'} In [13]: model['clf'].featureimportances Out[13]: array([0.34746472, 0.34359613, 0.22347966, 0.00633857, 0.01033549, 0.03439006, 0.03439537])


#### interpreting those importances again.. 
```python
cols = ['start_postal_code', 'start_sublocality', 'start_neighborhood', 'start_day', 'start_hour', 'age', 'gender']
namoopsoo commented 6 years ago

ok do another train, now cleaner

training..

(citilearnsage) $ docker run -v $(pwd)/local_test/test_dir:/opt/ml --rm citibike-learn-blah train
/opt/conda/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
/opt/conda/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/opt/conda/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
!missing stations: set(['Center Blvd\xc2\xa0& Borden Ave'])
(citilearnsage) $ 
(citilearnsage) $ ls -alrt local_test/test_dir/model/
total 184
drwxr-xr-x@ 5 michal  staff    170 Jul 15 14:32 ..
-rw-r--r--@ 1 michal  staff   6881 Jul 15 15:52 decision-tree-model.pkl
-rw-r--r--@ 1 michal  staff  37030 Jul 29 15:15 tree-foo-bundle.2018-07-29T1915ZUTC.pkl
-rw-r--r--@ 1 michal  staff  44124 Aug 12 13:12 tree-foo-bundle.2018-08-12T171255ZUTC.pkl

test..

In [1]: fn = 'sagemaker/local_test/test_dir/model/tree-foo-bundle.2018-08-12T171255ZUT
   ...: C.pkl'
   ...: 

In [2]: import cPickle
   ...: with open(fn) as fd:
   ...:     bundle = cPickle.load(fd)
   ...: bundle
   ...: 
Out[2]: 
{'bundle_name': 'tree-foo-bundle',
 'clf': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=2, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
             oob_score=False, random_state=0, verbose=0, warm_start=False),
 'clf_info': {'feature_importances': [('start_postal_code',
    0.28489345066586647),
   ('start_sublocality', 0.27812701364478254),
   ('start_neighborhood', 0.24554879740470287),
   ('start_day', 0.022793741554315638),
   ('start_hour', 0.004790427541855301),
   ('age', 0.002772245039587746),
   ('gender', 0.0739331660713863),
   ('usertype', 0.08714115807750318)]},
 'evaluation': {'validation_proportion_correct': 0.9759847738514625},
 'features': {'dtypes': {'age': float,
   'end_neighborhood': str,
   'start_neighborhood': str,
   'start_postal_code': str,
   'start_sublocality': str,
   'usertype': str},
  'input': ['start_postal_code',
   'start_sublocality',
   'start_neighborhood',
   'start_day',
   'start_hour',
   'age',
   'gender',
   'usertype'],
  'output_label': 'end_neighborhood'},
 'label_encoders': {'age': LabelEncoder(),
  'end_neighborhood': LabelEncoder(),
  'start_neighborhood': LabelEncoder(),
  'start_postal_code': LabelEncoder(),
  'start_sublocality': LabelEncoder(),
  'usertype': LabelEncoder()},
 'model_id': 'tree-foo',
 'timestamp': '2018-08-12T171255ZUTC',
 'train_metadata': {'stations_df_fn': '/opt/ml/input/config/start_stations_103115.csv',
  'trainset_fn': '/opt/ml/input/data/training/train.2018-07-28T210403.csv'}}

In [3]: import pandas as pd
   ...: holdout_df = pd.read_csv('/......./data/citibike/201601-citibike-tripdata.csv')
   ...: holdout_df.shape
   ...: import bikelearn.classify as blc
   ...: 
/usr/local/miniconda3/envs/citilearnsage/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/usr/local/miniconda3/envs/citilearnsage/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

In [6]: import bikelearn.settings as s
   ...: 

In [8]: import os

In [9]:         stations_fn = os.path.join(s.DATAS_DIR, 'start_stations_103115.fuller.
   ...: csv')
   ...:         stations_df = pd.read_csv(stations_fn, index_col=0, dtype={'postal_cod
   ...: e': str})
   ...: 

In [10]: %time y_predictions, y_test = blc.run_model_predict(bundle, holdout_df, stati
    ...: ons_df)
CPU times: user 1min 9s, sys: 1.4 s, total: 1min 10s
Wall time: 59.3 s

In [11]: y_predictions.shape, y_test.shape
Out[11]: ((501173,), (501173, 1))

In [12]: zipped = zip(y_predictions, [v[0] for v in y_test])

In [13]: zipped[:5]
Out[13]: [(8, 0), (8, 0), (8, 0), (8, 0), (8, 0)]

In [14]:     correct = len([[x,y] for x,y in zipped if x == y])
    ...: 

In [15]: proportion_correct = 1.0*correct/y_predictions.shape[0]

In [16]: proportion_correct
Out[16]: 0.011076015667244645
namoopsoo commented 6 years ago

next step: make the docker image also do model invocation

interactive

docker run -v $(pwd)/local_test/test_dir:/opt/ml -t -i citibike-learn-blah

serve ..

docker run -v $(pwd)/local_test/test_dir:/opt/ml citibike-learn-blah serve


#### oops... kept getting a not found..  `OSError` when running the `serve` 
* but it was because i forgot to uncomment the line to install `gunicorn`

Traceback (most recent call last): File "./serve", line 71, in start_server() File "./serve", line 50, in start_server '--timeout', str(model_server_timeout), File "/opt/conda/lib/python2.7/subprocess.py", line 394, in init errread, errwrite) File "/opt/conda/lib/python2.7/subprocess.py", line 1047, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory


#### try again...

(citilearnsage) $ docker run -v $(pwd)/local_test/test_dir:/opt/ml citibike-learn-blah serve Starting the inference server with 4 workers. [2018-09-09 15:17:28 +0000] [13] [INFO] Starting gunicorn 19.9.0 [2018-09-09 15:17:28 +0000] [13] [INFO] Listening at: unix:/tmp/gunicorn.sock (13) [2018-09-09 15:17:28 +0000] [13] [INFO] Using worker: gevent [2018-09-09 15:17:28 +0000] [17] [INFO] Booting worker with pid: 17 [2018-09-09 15:17:28 +0000] [18] [INFO] Booting worker with pid: 18 [2018-09-09 15:17:28 +0000] [20] [INFO] Booting worker with pid: 20 [2018-09-09 15:17:28 +0000] [22] [INFO] Booting worker with pid: 22


* ok nice workers spawned now.

### Ref
[1] https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html
namoopsoo commented 6 years ago

try to push to repository now...

$(aws --profile myblahprofile ecr get-login --no-include-email --region us-east-1)

docker tag citibike-learn-blah:latest citibike-learn-blah:0.1.0

docker tag citibike-learn-blah:latest xxxx.dkr.ecr.us-east-1.amazonaws.com/citibike-learn-blah:latest

docker push xxxx.dkr.ecr.us-east-1.amazonaws.com/citibike-learn-blah:latest
docker push xxxx.dkr.ecr.us-east-1.amazonaws.com/citibike-learn-blah:0.1.0

Quick notes

# serve locally..
docker run -p 8080:8080 -v $(pwd)/local_test/test_dir:/opt/ml citibike-learn-blah  serve

# run interactively for  local debugging..
docker run -p 8080:8080 -v $(pwd)/local_test/test_dir:/opt/ml -t -i citibike-learn-blah 
namoopsoo commented 6 years ago

kept getting a ValidationException when creating a new model custom SageMaker model

..

.dkr.ecr.us-east-1.amazonaws.com/citibike-learn-blah:0.1.4 # using the model which the training job created, s3://my-sagemaker-blah/bikelearn/artifacts/citibike-learn-first-job-6/output/model.tar.gz ``` #### quickest way to test a sagemaker endpoint , through aws i think ``` aws --profile myblahprofile sagemaker-runtime invoke-endpoint \ --endpoint-name "bikelearn-jade-bacon" \ --content-type 'text/csv' \ --body "starttime,start station name,usertype,birth year,gender\n10/1/2015 00:00:02,W 26 St & 10 Ave,Subscriber,1973,1\n10/1/2015 00:00:02,E 39 St & 2 Ave,Subscriber,1990,1" \ 'blahoutfile.json' ``` * hmm.. got a 500 #### make some changes, try again * ,, this time also incorporating a headerless update and more debug ``` aws --profile myblahprofile sagemaker-runtime invoke-endpoint \ --endpoint-name "bikelearn-astral-ankle" \ --content-type 'text/csv' \ --body "10/1/2015 00:00:02,W 26 St & 10 Ave,Subscriber,1973,1\n10/1/2015 00:00:02,E 39 St & 2 Ave,Subscriber,1990,1" \ 'blahoutfile.json' ``` #### hmm okay looking at the sagemaker logs, it looks like.. * the `\n` is escaped as `\\n` so since i'm sending two rows maybe i'll just send one insteaad ``` aws --profile myblahprofile sagemaker-runtime invoke-endpoint \ --endpoint-name "bikelearn-astral-ankle" \ --content-type 'text/csv' \ --body "10/1/2015 00:00:02,W 26 St & 10 Ave,Subscriber,1973,1" \ 'blahoutfile.json' ``` * Yay! worked.
namoopsoo commented 6 years ago

the basic POC model only returning 8?! Ruh ro.

Okay, looking at a hundred examples randomly, but still getting just 8.

In [68]: outputs = []

In [69]: traindf.shape
Out[69]: (969821, 16)

In [71]: for _ in range(100):
    ...:     i = random.randint(1111, 969820)
    ...:     data = make_data_dict_from_row_data(traindf.iloc[i].to_dict())
    ...:     # print data, call_api(data, url, headers).json()
    ...:     out = call_api(data, url, headers)
    ...:     try:
    ...:         out_json = out.json()
    ...:     except ValueError:
    ...:         out_json = {'text': out.text}
    ...:     except Exception as e:
    ...:         out_json = {'text': out.text, 'e': str(e.message)}
    ...:     outputs.append({'data': data, 'out_json': out_json, 'i': i})
    ...:     
    ...:     

In [72]: len(outputs)
Out[72]: 100

In [73]: outputs[0]
Out[73]: 
{'data': {'blah': {'birth_year': '1993',
   'rider_gender': '1',
   'rider_type': 'Subscriber',
   'start_station': 'Washington Pl & Broadway',
   'start_time': '10/28/2015 13:09:23'}},
 'i': 875240,
 'out_json': {u'output': u'8\n'}}

In [75]: not8 = [x for x in outputs if x['out_json']['output'] != u'8\n']

In [76]: len(not8)
Out[76]: 0

In [79]: bundle['label_encoders']
Out[79]: 
{'age': LabelEncoder(),
 'end_neighborhood': LabelEncoder(),
 'start_neighborhood': LabelEncoder(),
 'start_postal_code': LabelEncoder(),
 'start_sublocality': LabelEncoder(),
 'usertype': LabelEncoder()}

In [80]: bundle['label_encoders']['end_neighborhood']
Out[80]: LabelEncoder()

In [81]: vars(bundle['label_encoders']['end_neighborhood'])
Out[81]: 
{'classes_': array(['-1', 'Bedford-Stuyvesant', 'Brooklyn Heights',
        'Downtown Brooklyn', 'Fort Greene', 'Greenpoint',
        'Long Island City', 'Williamsburg', 'nan'], dtype=object)}

In [82]: len(vars(bundle['label_encoders']['end_neighborhood'])['classes_'])
Out[82]: 9

Need a deeper dive.

* also.. stations df has a `13/465` which are `nan`. not ideal.
```python
(Pdb) pp stations_df.head()
                  station_name postal_code sublocality     neighborhood state
0             W 26 St & 10 Ave       10001   Manhattan          Midtown    NY
1              E 39 St & 2 Ave       10016   Manhattan          Midtown    NY
2              8 Ave & W 52 St       10019   Manhattan          Midtown    NY
3  Sullivan St & Washington Sq       10012   Manhattan  Lower Manhattan    NY
4     Bedford Ave & Nassau Ave       11222    Brooklyn       Greenpoint    NY
(Pdb) pp stations_df[stations_df.neighborhood.isnull()].shape
(13, 5)
(Pdb) pp stations_df[stations_df.neighborhood.isnull()]
                     station_name postal_code sublocality neighborhood state
29   Clermont Ave & Lafayette Ave       11238    Brooklyn          NaN    NY
45    Montrose Ave & Bushwick Ave       11206    Brooklyn          NaN    NY
66            Nassau St & Navy St       11201    Brooklyn          NaN    NY
199           FDR Drive & E 35 St       10016   Manhattan          NaN    NY
203       Fulton St & Rockwell Pl       11201    Brooklyn          NaN    NY
230              York St & Jay St       11201    Brooklyn          NaN    NY
233    Carlton Ave & Flushing Ave       11205    Brooklyn          NaN    NY
328   Central Park West & W 68 St       10023   Manhattan          NaN    NY
336            Front St & Gold St       11201    Brooklyn          NaN    NY
392            Sands St & Navy St       11201    Brooklyn          NaN    NY
396       3 Ave & Schermerhorn St       11217    Brooklyn          NaN    NY
412    Clinton Ave & Flushing Ave       11205    Brooklyn          NaN    NY
440        Railroad Ave & Kay Ave       11249    Brooklyn          NaN    NY

wait a second, what the heck, it's even worse on the stations_df on the bundle

In [98]: stations_df = bundle['train_metadata']['stations_df']

In [99]: stations_df.shape
Out[99]: (462, 6)

In [100]: stations_df.head()
Out[100]: 
   Unnamed: 0                 station_name  postal_code sublocality neighborhood state
0           0             W 26 St & 10 Ave          NaN   Manhattan          NaN    NY
1           1              E 39 St & 2 Ave          NaN   Manhattan          NaN    NY
2           2              8 Ave & W 52 St          NaN   Manhattan          NaN    NY
3           3  Sullivan St & Washington Sq          NaN   Manhattan          NaN    NY
4           4     Bedford Ave & Nassau Ave          NaN    Brooklyn          NaN    NY

In [101]: stations_df[stations_df.neighborhood.isnull()].shape
Out[101]: (421, 6)
namoopsoo commented 6 years ago

okay, after fixing the stations_df , that particular nan problem is fixed!

url = 'https://rmuxqpksz2.execute-api.us-east-1.amazonaws.com/default/myBikelearnSageLambda' import requests def call_api(data, url, headers): out = requests.post(url, json=data, headers=headers) return out

outputs = [] for _ in range(100): i = random.randint(1111, 969820) data = make_data_dict_from_row_data(traindf.iloc[i].to_dict())

print data, call_api(data, url, headers).json()

out = call_api(data, url, headers)
try:
    out_json = out.json()
except ValueError:
    out_json = {'text': out.text}
except Exception as e:
    out_json = {'text': out.text, 'e': str(e.message)}
outputs.append({'data': data, 'out_json': out_json, 'i': i})

In [114]: Counter([x['out_json']['output'].strip() for x in outputs]) Out[114]: Counter({u'17': 100})

In [116]: with open('/Users/michal/Downloads/2018-10-21-comparingmodels/job-7/build/tree-foo-bundle.2018-10-21T205144ZUTC.pkl') as fd: ...: winterbundle = cPickle.load(fd) ...:

In [124]: winterbundle['label_encoders']['endneighborhood'].classes Out[124]: array(['-1', 'Alphabet City', 'Bedford-Stuyvesant', 'Boerum Hill', 'Brooklyn Heights', 'Central Park', 'Clinton Hill', 'Downtown Brooklyn', 'Dumbo', 'East Village', 'Fort Greene', 'Greenpoint', "Hell's Kitchen", 'Lincoln Square', 'Long Island City', 'Lower East Side', 'Lower Manhattan', 'Midtown', 'Midtown East', 'Midtown West', 'Murray Hill', 'Navy Yard', 'Upper East Side', 'Upper West Side', 'Vinegar Hill', 'Williamsburg', 'Yorkville'], dtype=object)

In [125]: len(winterbundle['label_encoders']['endneighborhood'].classes) Out[125]: 27

In [126]:

In [126]: winterbundle['evaluation'] Out[126]: {'validation_proportion_correct': 0.46358773861778346}

In [127]:

In [127]: simpledf.shape Out[127]: (969384, 31)

In [155]: print data {'blah': {'start_station': 'Forsyth St & Broome St', 'start_time': '10/8/2015 18:04:57', 'rider_gender': '2', 'rider_type': 'Subscriber', 'birth_year': '1973'}}

In [156]: foo = call_api(data, url, headers)

* So yea looks like that works. and ran on full dataset too..

In [138]: %time y_predictionsmany, y_testmany = blc.run_model_predict(winterbundle, atraindf, fullerstations_df, True) CPU times: user 2min 7s, sys: 4.41 s, total: 2min 11s Wall time: 2min 9s

In [139]: len(y_predictionsmany), len(y_testmany) Out[139]: (969384, 969384)

In [140]: %paste def get_basic_proportion_correct(y_test, y_predictions): zipped = zip(y_test, y_predictions) correct = len([[x,y] for x,y in zipped if x == y]) proportion_correct = 1.0*correct/y_test.shape[0] return proportion_correct

-- End pasted text --

In [141]: get_basic_proportion_correct(y_predictionsmany, y_testmany) Out[141]: 0.4634850585526479

In [143]: Counter(y_predictionsmany) Out[143]: Counter({16: 123450, 17: 811252, 25: 34682})

namoopsoo commented 6 years ago

ok cool. after updating stations_df, now got rid of the nans problem

In [139]:

In [139]: len(y_predictionsmany), len(y_testmany) Out[139]: (969384, 969384)

In [140]: %paste def get_basic_proportion_correct(y_test, y_predictions): zipped = zip(y_test, y_predictions) correct = len([[x,y] for x,y in zipped if x == y]) proportion_correct = 1.0*correct/y_test.shape[0] return proportion_correct

-- End pasted text --

In [141]: get_basic_proportion_correct(y_predictionsmany, y_testmany) Out[141]: 0.4634850585526479

In [142]: Counter(y_predictionsmany[:10]) Out[142]: Counter({16: 1, 17: 9})

In [143]: Counter(y_predictionsmany) Out[143]: Counter({16: 123450, 17: 811252, 25: 34682})

* ,
```python
In [152]: print dict(Counter([x[0] for x in y_testmany]))
{1: 1231, 2: 9011, 3: 2798, 4: 7428, 5: 6826, 6: 6835, 7: 12503, 8: 4165, 9: 6277, 10: 9415, 11: 9750, 12: 7191, 13: 5080, 14: 5628, 15: 12146, 16: 322132, 17: 409018, 18: 12325, 19: 9641, 20: 2444, 21: 1467, 22: 39364, 23: 34895, 24: 629, 25: 29848, 26: 1337}

url = 'https://rmuxqpksz2.execute-api.us-east-1.amazonaws.com/default/myBikelearnSageLambda'
import requests
def call_api(data, url, headers):

    out = requests.post(url, json=data, headers=headers)
    return out

In [154]: %history 109 
outputs = []
for _ in range(100):
    i = random.randint(1111, 969820)
    data = make_data_dict_from_row_data(traindf.iloc[i].to_dict())
    # print data, call_api(data, url, headers).json()
    out = call_api(data, url, headers)
    try:
        out_json = out.json()
    except ValueError:
        out_json = {'text': out.text}
    except Exception as e:
        out_json = {'text': out.text, 'e': str(e.message)}
    outputs.append({'data': data, 'out_json': out_json, 'i': i})

In [155]: print data
{'blah': {'start_station': 'Forsyth St & Broome St', 'start_time': '10/8/2015 18:04:57', 'rider_gender': '2', 'rider_type': 'Subscriber', 'birth_year': '1973'}}

In [156]: foo = call_api(data, url, headers)

In [157]: foo
Out[157]: <Response [200]>

In [158]: print foo.json()
{u'output': u'17\n'}

In [160]: call_api(data, url, {})
Out[160]: <Response [200]>

In [161]: call_api(data, url, {}).json()
Out[161]: {u'output': u'17\n'}
namoopsoo commented 6 years ago

more metrics...

namoopsoo commented 6 years ago

still seem to be seeing 'nan' calculated in the proportion correct data..

> /opt/conda/lib/python2.7/site-packages/bikelearn/metrics_utils.py(14)do_validation()
     13     classes = clf.classes_
---> 14     y_predict_proba = clf.predict_proba(X_validation)
     15 

ipdb> pp classes
array([1, 2, 3, 4, 5, 6, 7, 8])
ipdb> n
> /opt/conda/lib/python2.7/site-packages/bikelearn/metrics_utils.py(16)do_validation()
     15 
---> 16     metrics = gather_metrics(y_validation, y_predictions, y_predict_proba, classes)
     17     return metrics

ipdb> from collections import Counter
ipdb> Counter(y_predictions)
Counter({8: 100235})
ipdb> Counter(y_validation)
Counter({8: 97977, 7: 862, 5: 364, 4: 357, 6: 255, 1: 174, 3: 130, 2: 116})
ipdb> 

looking at confusion matrix, basically makes this nearly obvious,

ipdb> pp skm.confusion_matrix(y_validation, y_predictions, classes)
array([[    0,     0,     0,     0,     0,     0,     0,   174],
       [    0,     0,     0,     0,     0,     0,     0,   116],
       [    0,     0,     0,     0,     0,     0,     0,   130],
       [    0,     0,     0,     0,     0,     0,     0,   357],
       [    0,     0,     0,     0,     0,     0,     0,   364],
       [    0,     0,     0,     0,     0,     0,     0,   255],
       [    0,     0,     0,     0,     0,     0,     0,   862],
       [    0,     0,     0,     0,     0,     0,     0, 97977]])
ipdb> 

ipdb> pp zipped[:3] [(8, [8]), (8, [8]), (8, [8])] ipdb>


####  here..
```python

ipdb> pp label_encoders
{'age': LabelEncoder(),
 'end_neighborhood': LabelEncoder(),
 'start_neighborhood': LabelEncoder(),
 'start_postal_code': LabelEncoder(),
 'start_sublocality': LabelEncoder(),
 'usertype': LabelEncoder()}
ipdb> pp label_encoders['end_neighborhood']
LabelEncoder()
ipdb> pp label_encoders['end_neighborhood'].classes_
array(['-1', 'Bedford-Stuyvesant', 'Brooklyn Heights',
       'Downtown Brooklyn', 'Fort Greene', 'Greenpoint',
       'Long Island City', 'Williamsburg', 'nan'], dtype=object)
ipdb> 

adding a new assert just in case..

    assert not any(['nan' in le.classes_ for le in label_encoders.values()])
namoopsoo commented 6 years ago

ok try to get an update for stations..

import bikelearn.settings as s
s.GOOGLE_GEO_API_KEY

address = "W 26 St & 10 Ave"
import bikelearn.get_station_geolocation_data as getgeo

address = "W 26 St & 10 Ave"
data = getgeo.get_geocoding_results(address)
help(getgeo.get_geocoding_results)
getgeo.get_geocoding_results??
import ipdb
getgeo.get_geocoding_results??
data = ipdb.runcall(getgeo.get_geocoding_results, address, request_type='geo')

## stations file
stations_json_filename = \
        'datas/start_stations_103115.fuller.csv'
stationsdf = pd.read_csv(stations_json_filename, index_col=0)

foo = ipdb.runcall(getgeo.get_station_geoloc_data, stationsdf=stationsdf.iloc[:5])

..looking at my station data... its got some not ideal stuff in there...

In [84]: validsdf.levels.value_counts()
Out[84]: 
4    428
2    301
3     95
1     56
0      3
Name: levels, dtype: int64

In [85]: validsdf['type'] = validsdf['address'].map(lambda x: 'geo' if x.endswith('NY'
    ...: ) else 'latlng')

In [86]: validsdf[validsdf['type'] == 'geo'].levels.value_counts()
Out[86]: 
2    300
3     82
1     56
4     21
0      3
Name: levels, dtype: int64

In [87]: validsdf[validsdf['type'] == 'latlng'].levels.value_counts()
Out[87]: 
4    407
3     13
2      1
Name: levels, dtype: int64

deleted the incomplete ones. will try that agin .

In [99]: getgeo.redis_client.hdel(s.GEO_RAW_RESULTS, *addresses_todelete)
Out[99]: 441
namoopsoo commented 5 years ago

whoops

ipdb> pp geocoding_result
{u'error_message': u'You have exceeded your daily request quota for this API. If you did not set a custom daily request quota, verify your project has an active billing account: http://g.co/dev/maps-no-account',
 u'results': [],
 u'status': u'OVER_QUERY_LIMIT'}

having this result, where intersection i put in is recognized as a street

namoopsoo commented 5 years ago

after deleting and retrying, results were crappy again, but i might know why..

stationsdf = ipdb.runcall(getgeo.get_station_geoloc_data, stations[:50])

image

tracked down an edit in git log ,

commit 438e425482db1c105ae6a22b8248696d0f91dfef
Date:   Mon Oct 2 17:56:15 2017 -0400

...

Update code and try again...

namoopsoo commented 5 years ago

Hmm, but still problematic.

stationsdf = ipdb.runcall(getgeo.get_station_geoloc_data, stations[:50])

In [167]: getgeo.extract_lat_lng_from_response(out.json()['results'])
Out[167]: {u'lat': 40.7957399, u'lng': -73.93892129999999}

The station gathering code ... , also,

stations = getgeo.extract_stations_from_files(filenames=[
    "201510-citibike-tripdata.csv", "201601-citibike-tripdata.csv",
    ...
    ]

{u'plus_code': {u'compound_code': u'M2XJ+44 New York, NY, USA',
                u'global_code': u'87G8M2XJ+44'},
 u'results': [{u'address_components': [{u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'Clinton Ave & Flushing Ave, United States',
               u'geometry': {u'location': {u'lat': 40.69794,
                                           u'lng': -73.96986849999999},
                             u'location_type': u'GEOMETRIC_CENTER',
                             u'viewport': {u'northeast': {u'lat': 40.6992889802915,
                                                          u'lng': -73.9685195197085},
                                           u'southwest': {u'lat': 40.6965910197085,
                                                          u'lng': -73.9712174802915}}},
               u'place_id': u'ChIJKwagFMZbwokRtjuK5VRvb-E',
               u'plus_code': {u'compound_code': u'M2XJ+53 New York, United States',
                              u'global_code': u'87G8M2XJ+53'},
               u'types': [u'establishment', u'point_of_interest']},
              {u'address_components': [{u'long_name': u'164',
                                        u'short_name': u'164',
                                        u'types': [u'street_number']},
                                       {u'long_name': u'Flushing Avenue',
                                        u'short_name': u'Flushing Ave',
                                        u'types': [u'route']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'11205',
                                        u'short_name': u'11205',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'164 Flushing Ave, Brooklyn, NY 11205, USA',
               u'geometry': {u'location': {u'lat': 40.6976577,
                                           u'lng': -73.9699304},
                             u'location_type': u'ROOFTOP',
                             u'viewport': {u'northeast': {u'lat': 40.6990066802915,
                                                          u'lng': -73.9685814197085},
                                           u'southwest': {u'lat': 40.6963087197085,
                                                          u'lng': -73.9712793802915}}},
               u'place_id': u'ChIJ3TPWP8ZbwokR8dlF4H6Vr8Q',
               u'plus_code': {u'compound_code': u'M2XJ+32 New York, United States',
                              u'global_code': u'87G8M2XJ+32'},
               u'types': [u'street_address']},
              {u'address_components': [{u'long_name': u'168',
                                        u'short_name': u'168',
                                        u'types': [u'street_number']},
                                       {u'long_name': u'Flushing Avenue',
                                        u'short_name': u'Flushing Ave',
                                        u'types': [u'route']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'11205',
                                        u'short_name': u'11205',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'168 Flushing Ave, Brooklyn, NY 11205, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.6976934,
                                                        u'lng': -73.96936649999999},
                                         u'southwest': {u'lat': 40.697549,
                                                        u'lng': -73.969557}},
                             u'location': {u'lat': 40.6976345,
                                           u'lng': -73.969481},
                             u'location_type': u'ROOFTOP',
                             u'viewport': {u'northeast': {u'lat': 40.6989701802915,
                                                          u'lng': -73.96811276970848},
                                           u'southwest': {u'lat': 40.6962722197085,
                                                          u'lng': -73.9708107302915}}},
               u'place_id': u'ChIJ43kQasZbwokRMEsGfYDYL6Y',
               u'types': [u'premise']},
              {u'address_components': [{u'long_name': u'99',
                                        u'short_name': u'99',
                                        u'types': [u'street_number']},
                                       {u'long_name': u'Flushing Avenue',
                                        u'short_name': u'Flushing Ave',
                                        u'types': [u'route']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'11205',
                                        u'short_name': u'11205',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'99 Flushing Ave, Brooklyn, NY 11205, USA',
               u'geometry': {u'location': {u'lat': 40.6978344,
                                           u'lng': -73.96973249999999},
                             u'location_type': u'RANGE_INTERPOLATED',
                             u'viewport': {u'northeast': {u'lat': 40.69918338029149,
                                                          u'lng': -73.96838351970848},
                                           u'southwest': {u'lat': 40.69648541970849,
                                                          u'lng': -73.9710814802915}}},
               u'place_id': u'Eig5OSBGbHVzaGluZyBBdmUsIEJyb29rbHluLCBOWSAxMTIwNSwgVVNBIhoSGAoUChIJtcpdFsZbwokRiFlwXEBWntUQYw',
               u'types': [u'street_address']},
              {u'address_components': [{u'long_name': u'Clinton Avenue',
                                        u'short_name': u'Clinton Ave',
                                        u'types': [u'route']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'11205',
                                        u'short_name': u'11205',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'Clinton Ave, Brooklyn, NY 11205, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.6978344,
                                                        u'lng': -73.96973249999999},
                                         u'southwest': {u'lat': 40.6977702,
                                                        u'lng': -73.9697406}},
                             u'location': {u'lat': 40.6978023,
                                           u'lng': -73.96973659999999},
                             u'location_type': u'GEOMETRIC_CENTER',
                             u'viewport': {u'northeast': {u'lat': 40.6991512802915,
                                                          u'lng': -73.96838756970848},
                                           u'southwest': {u'lat': 40.6964533197085,
                                                          u'lng': -73.9710855302915}}},
               u'place_id': u'ChIJu0sva8ZbwokRrMKu_ii2I2Y',
               u'types': [u'route']},
              {u'address_components': [{u'long_name': u'11205',
                                        u'short_name': u'11205',
                                        u'types': [u'postal_code']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'Brooklyn, NY 11205, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.7066249,
                                                        u'lng': -73.948331},
                                         u'southwest': {u'lat': 40.68741490000001,
                                                        u'lng': -73.980632}},
                             u'location': {u'lat': 40.6945036,
                                           u'lng': -73.9565551},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 40.7066249,
                                                          u'lng': -73.948331},
                                           u'southwest': {u'lat': 40.68741490000001,
                                                          u'lng': -73.980632}}},
               u'place_id': u'ChIJLywE8sZbwokRiLapmmo79YU',
               u'types': [u'postal_code']},
              {u'address_components': [{u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'Kings County, Brooklyn, NY, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.739446,
                                                        u'lng': -73.8333651},
                                         u'southwest': {u'lat': 40.551042,
                                                        u'lng': -74.05663}},
                             u'location': {u'lat': 40.6528762,
                                           u'lng': -73.95949399999999},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 40.739446,
                                                          u'lng': -73.8333651},
                                           u'southwest': {u'lat': 40.551042,
                                                          u'lng': -74.05663}}},
               u'place_id': u'ChIJOwE7_GTtwokRs75rhW4_I6M',
               u'types': [u'administrative_area_level_2', u'political']},
              {u'address_components': [{u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'Brooklyn, NY, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.739446,
                                                        u'lng': -73.8333651},
                                         u'southwest': {u'lat': 40.551042,
                                                        u'lng': -74.05663}},
                             u'location': {u'lat': 40.6781784,
                                           u'lng': -73.9441579},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 40.739446,
                                                          u'lng': -73.8333651},
                                           u'southwest': {u'lat': 40.551042,
                                                          u'lng': -74.05663}}},
               u'place_id': u'ChIJCSF8lBZEwokRhngABHRcdoI',
               u'types': [u'political',
                          u'sublocality',
                          u'sublocality_level_1']},
              {u'address_components': [{u'long_name': u'New York',
                                        u'short_name': u'New York',
                                        u'types': [u'locality',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'New York, NY, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.9175771,
                                                        u'lng': -73.70027209999999},
                                         u'southwest': {u'lat': 40.4773991,
                                                        u'lng': -74.25908989999999}},
                             u'location': {u'lat': 40.7127753,
                                           u'lng': -74.0059728},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 40.9175771,
                                                          u'lng': -73.70027209999999},
                                           u'southwest': {u'lat': 40.4773991,
                                                          u'lng': -74.25908989999999}}},
               u'place_id': u'ChIJOwg_06VPwokRYv534QaPC8g',
               u'types': [u'locality', u'political']},
              {u'address_components': [{u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'New York, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 45.015865,
                                                        u'lng': -71.777491},
                                         u'southwest': {u'lat': 40.4773991,
                                                        u'lng': -79.7625901}},
                             u'location': {u'lat': 43.2994285,
                                           u'lng': -74.21793260000001},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 45.015865,
                                                          u'lng': -71.777491},
                                           u'southwest': {u'lat': 40.4773991,
                                                          u'lng': -79.7625901}}},
               u'place_id': u'ChIJqaUj8fBLzEwRZ5UY3sHGz90',
               u'types': [u'administrative_area_level_1', u'political']},
              {u'address_components': [{u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'United States',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 71.5388001,
                                                        u'lng': -66.885417},
                                         u'southwest': {u'lat': 18.7763,
                                                        u'lng': 170.5957}},
                             u'location': {u'lat': 37.09024,
                                           u'lng': -95.712891},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 71.5388001,
                                                          u'lng': -66.885417},
                                           u'southwest': {u'lat': 18.7763,
                                                          u'lng': 170.5957}}},
               u'place_id': u'ChIJCzYy5IS16lQRQrfeQ5K5Oxw',
               u'types': [u'country', u'political']}],
 u'status': u'OK'}
namoopsoo commented 5 years ago

station lat long also changed. darn dedupe still other concerns then.


In [421]: sorted(dedupeddf['start station name'].value_counts().to_dict().items(), key
     ...: =lambda x: x[1])[-1]
Out[421]: ('Kent Ave & N 7 St', 3)

In [422]: dedupeddf[dedupeddf['start station name'] == 'Kent Ave & N 7 St']
Out[422]: 
    start station name              ...                                     latlng
360  Kent Ave & N 7 St              ...               40.7207255341,-73.9612591267
433  Kent Ave & N 7 St              ...                   40.72057658,-73.96150225
441  Kent Ave & N 7 St              ...                40.720367753,-73.9616507292

[3 rows x 4 columns]

In [423]: dedupeddf[dedupeddf['start station name'] == 'Kent Ave & N 7 St'].values
Out[423]: 
array([['Kent Ave & N 7 St', 40.720725534125954, -73.9612591266632,
        '40.7207255341,-73.9612591267'],
       ['Kent Ave & N 7 St', 40.72057658, -73.96150225,
        '40.72057658,-73.96150225'],
       ['Kent Ave & N 7 St', 40.72036775298455, -73.96165072917937,
        '40.720367753,-73.9616507292']], dtype=object)
namoopsoo commented 5 years ago

dedupe differently

 %time stationsdf = getgeo.extract_stations_latlng_df_from_files(filenames=[
...
])
# Wall time: 1min 25s
# In [414]: stationsdf.shape
# Out[414]: (7127, 4)

dedupeddf = getgeo.some_stationdf_dedupe(stationsdf)
# In [431]: dedupeddf.shape
# Out[431]: (682, 4)

annotated_df = getgeo.annotate_station_df(dedupeddf)

newdf = pd.concat([dedupeddf, foodf], axis=1)
In [563]: handmade
Out[563]: 
[{'neighborhood': 'Fort Greene', 'station': 'DeKalb Ave & Hudson Ave'},
 {'neighborhood': u'NoMad',
  'postal_code': u'10001',
  'state': u'NY',
  'station': 'Broadway & W 29 St',
  'sublocality': u'Manhattan'},
 {'neighborhood': 'Brooklyn Navy Yard', 'station': 'Sands St & Gold St'},
 {'neighborhood': 'Dumbo', 'station': 'York St & Jay St'},
 {'neighborhood': 'Brooklyn Navy Yard',
  'station': 'Flushing Ave & Carlton Ave'},
 {'neighborhood': 'Columbia Street Waterfront District',
  'station': 'Atlantic Ave & Furman St'},
 {'neighborhood': 'Brooklyn Navy Yard', 'station': 'Railroad Ave & Kay Ave'},
 {'neighborhood': 'Brooklyn Navy Yard',
  'station': 'Clinton Ave & Flushing Ave'},
 {'neighborhood': 'Park Slope', 'station': 'Dean St & 4 Ave'},
 {'neighborhood': 'Brooklyn Navy Yard', 'station': 'Nassau St & Navy St'},
 {'neighborhood': 'Vinegar Hill', 'station': 'Front St & Gold St'},
 {'neighborhood': 'Brooklyn Navy Yard', 'station': '7 Ave & Farragut St'},
 {'neighborhood': 'Brooklyn Navy Yard', 'station': 'Sands St & Navy St'},
 {'neighborhood': 'Brooklyn Navy Yard',
  'station': 'Carlton Ave & Flushing Ave'},
 {'neighborhood': 'Financial District', 'station': 'Peck Slip & Front St'},
 {'neighborhood': 'Prospect Heights',
  'station': 'Bike The Branches - Central Branch'},
 {'neighborhood': 'Cobble Hill', 'station': 'Henry St & Degraw St'},
 {'neighborhood': 'Park Slope', 'station': 'Union St & 4 Ave'},
 {'neighborhood': 'Park Slope', 'station': 'Douglass St & 4 Ave'},
 {'neighborhood': 'Prospect Park',
  'station': 'Bike in Movie Night | Prospect Park Bandshell'},
 {'neighborhood': 'Park Slope', 'station': 'West Drive & Prospect Park West'},
 {'neighborhood': 'Park Slope', 'station': '4 Ave & 2 St'}]
handmadedf = handmadedf.rename(columns={'neighborhood': 'hand_neighborhood'})

moredf = newdf.merge(handmadedf, left_on='start station name', right_on='station', how='left')

# update those 22 hand made neighborhood labels. 
moredf.ix[moredf[moredf.neighborhood.isnull()].index.tolist(),'neighborhood']  = moredf.ix[moredf[moredf.neighborhood.isnull()].index.tolist(), 'hand_neighborhood']

moredf.drop(labels=['sublocality_y', 'state_y', 'postal_code_y'],axis=1, inplace=True)

moredf.rename(columns={'sublocality_x':'sublocality','state_x':'state','postal_code_x': 'postal_code'},inplace=True)

moredf.drop(labels=['hand_neighborhood', 'station'],axis=1, inplace=True)

In [602]: fn = '/........./learn-citibike/datas/station
     ...: s/stations-2018-12-04-b.pkl'

In [603]: moredf.to_pickle(fn)
namoopsoo commented 5 years ago

previously missing when training few weeks ago..


In [604]: previouslymissing = ['MacDougal St & Prince St', 'Washington Square E', 'E 8
     ...: 1 St & York Ave', 'Schermerhorn St & Court St', 'E 77 St & Park Ave', 'Cente
     ...: r Blvd & Borden Ave', 'E 77 St & 3 Ave', 'E 71 St & 2 Ave', 'University Pl &
     ...:  E 8 St', 'Leonard St & Meeker Ave', 'PABT Valet', 'Broadway & Roebling St',
     ...:  'E 80 St & 2 Ave', 'Union Ave & N 12 St', 'Monroe St & Tompkins Ave', 'E 40
     ...:  St & 5 Ave', 'E 58 St & 1 Ave']

In [605]: moredf['start station name'].value_counts().shape
Out[605]: (682,)

In [606]: moredf.shape
Out[606]: (682, 9)

In [607]: moredf[moredf['start station name'].isin(previouslymissing)].shape
Out[607]: (17, 9)

In [608]: len(previouslymissing)
Out[608]: 17

another re-train then.

namoopsoo commented 5 years ago

print one of the trees..

from sklearn import tree
from sklearn.externals.six import StringIO  
import pydot 

clffnpdf = 'myfile.pdf'

dot_data = StringIO()

# First one 
clf = bundle['clf'].estimators_[0] 
tree.export_graphviz(clf, out_file=dot_data) 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf(clffnpdf) 

clf.1.dot.pdf

namoopsoo commented 5 years ago

ah darn batch transform size limitations?

hmm reading on docs, 6MiB is some kind of a default

ok that worked.

namoopsoo commented 5 years ago

look at a few of the months

import json

def unpack_oneliners(concatenated):
    elements = concatenated[1:-1].split('}{')
    normalized = ['{' + x + '}' for x in elements]
    return [json.loads(x) for x in normalized]

def unpack_multiliners(concatenated):
    elements = concatenated.split('\n')
    elements_no_empties = [x for x in elements if x != '']
    return [json.loads(x) for x in elements_no_empties]

def unpack_concatenated(concatenated):
    if '}{' in concatenated:
        return unpack_oneliners(concatenated)
    elif '\n' in concatenated:
        return unpack_multiliners(concatenated)
    raise Exception('Shouldnt be here')

def read_file(fn):
    with open(fn) as fd:
        return fd.read()

local_filenames = [
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201603-citibike-tripdata.csv.out',
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201604-citibike-tripdata.csv.out',
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201605-citibike-tripdata.csv.out',
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201606-citibike-tripdata.csv.out',
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201607-citibike-tripdata.csv.out',
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201608-citibike-tripdata.csv.out',]

batch_outputs = [read_file(x) for x in local_filenames]
unpacked = [unpack_concatenated(x) for x in batch_outputs]
rank_proba_scores = [
        [x.get("rank_k_proba_scores").get('10') for x in vec]
        for vec in unpacked]

monthly_means = [np.mean(vec) for vec in rank_proba_scores]

rank_k_proba_scores = [x.get("rank_k_proba_scores") for x in unpack_concatenated(concatenated_data)]

len(rank_k_proba_scores)
[x.get('10') for x in rank_k_proba_scores]
namoopsoo commented 5 years ago

Looks like the same neighborhood class is like the default of sorts, for at least each of the mini batches used in the unpacked[-1] , or the last Batch output , 201608.

which class is 9?

import cPickle
import bikelearn.classify as blc

fn = '/Users/michal/Downloads/2018-12-07-update-model/2018-12-07-update-model/tree-foo-bundle-pensive-swirles.2018-12-04T210259ZUTC.pkl'

with open(fn) as fd: bundle = cPickle.load(fd)

# label_encoder for the end neighborhood..

blc.label_decode(bundle['label_encoders']['end_neighborhood'], [9])

# this does the , label_encoder.inverse_transform(vec)

* What the heck `Central Park` heh. can this really be true? Pretty weird.
namoopsoo commented 5 years ago

Thoughts on next steps

namoopsoo commented 5 years ago

using the 201610 ... and beyond data.

(citilearnsage) $ head  ../data/citibike/201610-citibike-tripdata.csv
Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
328,2016-10-01 00:00:07,2016-10-01 00:05:35,471,Grand St & Havemeyer St,40.71286844,-73.95698119,3077,Stagg St & Union Ave,40.70877084,-73.95095259,25254,Subscriber,1992,1
398,2016-10-01 00:00:11,2016-10-01 00:06:49,3147,E 85 St & 3 Ave,40.77801203,-73.95407149,3140,1 Ave & E 78 St,40.77140426,-73.9535166,17810,Subscriber,1988,2
430,2016-10-01 00:00:14,2016-10-01 00:07:25,345,W 13 St & 6 Ave,40.73649403,-73.99704374,470,W 20 St & 8 Ave,40.74345335,-74.00004031,20940,Subscriber,1965,1
351,2016-10-01 00:00:21,2016-10-01 00:06:12,3307,West End Ave & W 94 St,40.7941654,-73.974124,3357,W 106 St & Amsterdam Ave,40.8008363,-73.9664492472,19086,Subscriber,1993,1
2693,2016-10-01 00:00:21,2016-10-01 00:45:15,3428,8 Ave & W 16 St,40.740983,-74.001702,3323,W 106 St & Central Park West,40.7981856,-73.9605909006,26502,Subscriber,1991,1
513,2016-10-01 00:00:28,2016-10-01 00:09:02,433,E 13 St & Avenue A,40.72955361,-73.98057249,151,Cleveland Pl & Spring St,40.722103786686034,-73.99724900722504,25800,Subscriber,1995,1
601,2016-10-01 00:00:51,2016-10-01 00:10:52,3314,W 95 St & Broadway,40.7937704,-73.971888,3374,Central Park North & Adam Clayton Powell Blvd,40.799484,-73.955613,15985,Subscriber,1972,2
563,2016-10-01 00:00:54,2016-10-01 00:10:18,453,W 22 St & 8 Ave,40.74475148,-73.99915362,485,W 37 St & 5 Ave,40.75038009,-73.98338988,26018,Subscriber,1984,1
439,2016-10-01 00:00:54,2016-10-01 00:08:13,534,Water - Whitehall Plaza,40.70255065,-74.0127234,360,William St & Pine St,40.70717936,-74.00887308,15374,Subscriber,1968,1
(citilearnsage) $ 
(citilearnsage) $ head  -2 ../data/citibike/201605-citibike-tripdata.csv
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"
"538","5/1/2016 00:00:03","5/1/2016 00:09:02","536","1 Ave & E 30 St","40.74144387","-73.97536082","497","E 17 St & Broadway","40.73704984","-73.99009296","23097","Subscriber","1986","2"
namoopsoo commented 5 years ago

Quick recap

* Because the file formats on `201610`, `201611`, `201612` as far as i can tell have changed the date formats. 
* I corrected the Docker code reading dates from `pensive-swirles-2-11` to `pensive-swirles-2-12` and the problem went away.

#### But furthermore...
* I then automated the batch transform w/ the API , but the first time around job `Batch-Transform-2019-01-06-233207` failed because i forgot to include my special ENVIRONMENT variable, `DO_VALIDATION` for indicating  validation. 
* I did add this for the job `Batch-Transform-2019-01-06-235825` and that one was successful.

#### but then the log ...
* for `Batch-Transform-2019-01-06-235833` , has a mix bag of success and also timeouts, 

2019/01/07 00:02:10 [error] 9#9: *10 upstream timed out (110: Connection timed out) while sending request to upstream, client: 169.254.255.130, server: , request: "POST /invocations HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock/invocations", host: "169.254.255.131:8080"


* And i dont see the typical output file generated for this on s3, so i dont think this finished.
namoopsoo commented 5 years ago

batch transform, automated since theres so many batch transforms i wanted to run

import upload_client as uc

inputs = ['201611-citibike-tripdata.csv.gz', '201612-citibike-tripdata.csv.gz', '201701-citibike-tripdata.csv.gz']
for inputfile in inputs:
    print 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfile
for inputfile in inputs:
    path = 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfile
    print path
    print uc.start_batch_transform_job(input_location=path)
    import time
    time.sleep(2)

In [17]: reload(uc)
Out[17]: <module 'update_client' from 'update_client.py'>

In [18]: inputs = ['201611-citibike-tripdata.csv.gz', '201612-citibike-tripdata.csv.gz
    ...: ', '201701-citibike-tripdata.csv.gz']

In [19]: for inputfile in inputs:
    ...:     print 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfile
    ...: 
    ...:     
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201611-citibike-tripdata.csv.gz
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201612-citibike-tripdata.csv.gz
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201701-citibike-tripdata.csv.gz

In [20]: for inputfile in inputs:
    ...:     path = 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfil
    ...: e
    ...:     print path
    ...:     print uc.start_batch_transform_job(input_location=path)
    ...:     import time
    ...:     time.sleep(2)
    ...:     
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201611-citibike-tripdata.csv.gz
batch_job_name , Batch-Transform-2019-01-07-001813
created transform job with name:  Batch-Transform-2019-01-07-001813
None
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201612-citibike-tripdata.csv.gz
batch_job_name , Batch-Transform-2019-01-07-001815
created transform job with name:  Batch-Transform-2019-01-07-001815
None
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201701-citibike-tripdata.csv.gz
batch_job_name , Batch-Transform-2019-01-07-001818
created transform job with name:  Batch-Transform-2019-01-07-001818
None

In [21]: