namoopsoo / learn-citibike

0 stars 0 forks source link

SageMaker model retrain #16

Closed namoopsoo closed 5 years ago

namoopsoo commented 6 years ago

General plan

[x] Prepare a bikelearn.tar.gz pip package which can be installed in a Docker container, to run a training job,
[x] Build a Docker container which can run a train command, taking data from an external directory, and placing a generated model in an output directory.
[ ] Script that creates data for training/validation. But with some variation, so that it's relatively easy to recreate new data with different parameters. The data typically exists

make python package ..

source activate citilearnsage
(citilearnsage) $ python setup.py sdist
(citilearnsage) $ cp dist/bikelearn-0.1.2.tar.gz  sagemaker/mypackages/

Docker build..

Update dockerfiles/Dockerfile to reflect new package , dist/bikelearn-0.1.2.tar.gz , if needed.

build,


(citilearnsage) $ cd sagemaker
(citilearnsage) $ docker build -f dockerfiles/Dockerfile -t citibike-learn-blah .


#### Train job
* from within `sagemaker` dir...
```bash
(citilearnsage) $ export image=citibike-learn-blah
(citilearnsage) $  docker run -v $(pwd)/local_test/test_dir:/opt/ml --rm ${image} train

See new model bundle pickle and or crash report

(citilearnsage) $ ls local_test/test_dir/model/
bundle_meta.json
...
tree-foo-bundle.2018-09-08T225947ZUTC.pkl
(citilearnsage) $ date
Sat Sep  8 19:01:59 EDT 2018

edit local_test/test_dir/model/bundle_meta.json if needed, with new model name

Run serve endpoint

Make datasets ,

python make_datasets.py
(citilearnsage) $ python make_datasets.py ${MY_LOCAL_DATA_DIR}/201510-citibike-tripdata.csv ${MY_BIKE_REPO_DIR}/sagemaker/local_test/test_dir/input/data/training/
# =>

Test running the train job from inside the container,

# go into learn-citibike/sagemaker

(citilearnsage) $ docker run -v $(pwd)/local_test/test_dir:/opt/ml -t -i citibike-learn-blah 
root@d38ba643e4ce:/opt/program# 
root@d38ba643e4ce:/opt/program#  ipython

from bikelearn.models import treefoo

and from shell .. docker run -v $(pwd)/test_dir:/opt/ml --rm ${image} train

namoopsoo commented 6 years ago

small snafu with zip code being perceived as a float for some reason...

while running label encoder...


root@9984a296c093:/opt/program# python train
/opt/conda/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/opt/conda/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
Exception("unknown dtype<type 'float'>",)
> /opt/conda/lib/python2.7/site-packages/bikelearn/classify.py(78)build_label_encoders_from_df()
 77         else:
---> 78             raise Exception, 'unknown dtype' + str(dtype)
 79

ipdb> pp feature_encoding ['start_postal_code', 'start_sublocality', 'start_neighborhood', 'end_neighborhood'] ipdb> pp feature in feature_encoding True ipdb> pp feature 'start_postal_code' ipdb> pp dtype <type 'float'> ipdb> pp dtype == float True ipdb> pp df[feature].head().values array([11249., 11249., 11249., 11249., 11249.]) ipdb>


#### ok... changed up that 

#### also from quick run earlier...
```python
In [54]: clf = RandomForestClassifier(max_depth=2, random_state=0)

In [55]: clf.fit(X_train, y_train)
Out[55]: 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [57]: print clf.feature_importances_
[0.20919847 0.29880775 0.37548998 0.02124216 0.03291293 0.03416068
 0.02818804]

namoopsoo commented 6 years ago

ok yay, got a model trained .

pickle generated in running python train in the container manually.

I didn't package the label encoders though and so a few more steps are needed to actually execute the model. heh. But the train step didnt fail. so that's good.


root@d19ff4f4aa5e:/opt/program# python train
/opt/conda/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
/opt/conda/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/opt/conda/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
root@d19ff4f4aa5e:/opt/program# ipython
...
...
In [1]: import cPickle
...
In [10]: modelfn = '/opt/ml/model/tree-foo-bundle.pkl'

In [11]: model = cPickle.load(open(modelfn)) /opt/conda/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release. from numpy.core.umath_tests import inner1d

In [12]: model Out[12]: {'bundle_name': 'tree-foo-bundle.pkl', 'clf': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=2, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=0, verbose=0, warm_start=False), 'model_id': 'tree-foo'} In [13]: model['clf'].featureimportances Out[13]: array([0.34746472, 0.34359613, 0.22347966, 0.00633857, 0.01033549, 0.03439006, 0.03439537])


#### interpreting those importances again.. 
```python
cols = ['start_postal_code', 'start_sublocality', 'start_neighborhood', 'start_day', 'start_hour', 'age', 'gender']

It's definitely strange that the start_day, start_hour are not impacting the decision whatsoever.

namoopsoo commented 6 years ago

ok do another train, now cleaner

Now label encoders are getting stored in the bundle too.
I ran a train through docker locally, using a 2015/10 set and did a test using a 2016/01 set,
The result looks terrible heh, but nothing crashed.

training..

(citilearnsage) $ docker run -v $(pwd)/local_test/test_dir:/opt/ml --rm citibike-learn-blah train
/opt/conda/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
/opt/conda/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/opt/conda/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
!missing stations: set(['Center Blvd\xc2\xa0& Borden Ave'])
(citilearnsage) $ 
(citilearnsage) $ ls -alrt local_test/test_dir/model/
total 184
drwxr-xr-x@ 5 michal  staff    170 Jul 15 14:32 ..
-rw-r--r--@ 1 michal  staff   6881 Jul 15 15:52 decision-tree-model.pkl
-rw-r--r--@ 1 michal  staff  37030 Jul 29 15:15 tree-foo-bundle.2018-07-29T1915ZUTC.pkl
-rw-r--r--@ 1 michal  staff  44124 Aug 12 13:12 tree-foo-bundle.2018-08-12T171255ZUTC.pkl

test..

In [1]: fn = 'sagemaker/local_test/test_dir/model/tree-foo-bundle.2018-08-12T171255ZUT
   ...: C.pkl'
   ...: 

In [2]: import cPickle
   ...: with open(fn) as fd:
   ...:     bundle = cPickle.load(fd)
   ...: bundle
   ...: 
Out[2]: 
{'bundle_name': 'tree-foo-bundle',
 'clf': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=2, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
             oob_score=False, random_state=0, verbose=0, warm_start=False),
 'clf_info': {'feature_importances': [('start_postal_code',
    0.28489345066586647),
   ('start_sublocality', 0.27812701364478254),
   ('start_neighborhood', 0.24554879740470287),
   ('start_day', 0.022793741554315638),
   ('start_hour', 0.004790427541855301),
   ('age', 0.002772245039587746),
   ('gender', 0.0739331660713863),
   ('usertype', 0.08714115807750318)]},
 'evaluation': {'validation_proportion_correct': 0.9759847738514625},
 'features': {'dtypes': {'age': float,
   'end_neighborhood': str,
   'start_neighborhood': str,
   'start_postal_code': str,
   'start_sublocality': str,
   'usertype': str},
  'input': ['start_postal_code',
   'start_sublocality',
   'start_neighborhood',
   'start_day',
   'start_hour',
   'age',
   'gender',
   'usertype'],
  'output_label': 'end_neighborhood'},
 'label_encoders': {'age': LabelEncoder(),
  'end_neighborhood': LabelEncoder(),
  'start_neighborhood': LabelEncoder(),
  'start_postal_code': LabelEncoder(),
  'start_sublocality': LabelEncoder(),
  'usertype': LabelEncoder()},
 'model_id': 'tree-foo',
 'timestamp': '2018-08-12T171255ZUTC',
 'train_metadata': {'stations_df_fn': '/opt/ml/input/config/start_stations_103115.csv',
  'trainset_fn': '/opt/ml/input/data/training/train.2018-07-28T210403.csv'}}

In [3]: import pandas as pd
   ...: holdout_df = pd.read_csv('/......./data/citibike/201601-citibike-tripdata.csv')
   ...: holdout_df.shape
   ...: import bikelearn.classify as blc
   ...: 
/usr/local/miniconda3/envs/citilearnsage/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/usr/local/miniconda3/envs/citilearnsage/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

In [6]: import bikelearn.settings as s
   ...: 

In [8]: import os

In [9]:         stations_fn = os.path.join(s.DATAS_DIR, 'start_stations_103115.fuller.
   ...: csv')
   ...:         stations_df = pd.read_csv(stations_fn, index_col=0, dtype={'postal_cod
   ...: e': str})
   ...: 

In [10]: %time y_predictions, y_test = blc.run_model_predict(bundle, holdout_df, stati
    ...: ons_df)
CPU times: user 1min 9s, sys: 1.4 s, total: 1min 10s
Wall time: 59.3 s

In [11]: y_predictions.shape, y_test.shape
Out[11]: ((501173,), (501173, 1))

In [12]: zipped = zip(y_predictions, [v[0] for v in y_test])

In [13]: zipped[:5]
Out[13]: [(8, 0), (8, 0), (8, 0), (8, 0), (8, 0)]

In [14]:     correct = len([[x,y] for x,y in zipped if x == y])
    ...: 

In [15]: proportion_correct = 1.0*correct/y_predictions.shape[0]

In [16]: proportion_correct
Out[16]: 0.011076015667244645

namoopsoo commented 6 years ago

next step: make the docker image also do model invocation

try to test an invocation

Run the docker image as a server for handling requests per [1]


# docker run <image> serve
# docker run -v $(pwd)/test_dir:/opt/ml --rm ${image} train

interactive

docker run -v $(pwd)/local_test/test_dir:/opt/ml -t -i citibike-learn-blah

serve ..

docker run -v $(pwd)/local_test/test_dir:/opt/ml citibike-learn-blah serve


#### oops... kept getting a not found..  `OSError` when running the `serve` 
* but it was because i forgot to uncomment the line to install `gunicorn`

Traceback (most recent call last): File "./serve", line 71, in start_server() File "./serve", line 50, in start_server '--timeout', str(model_server_timeout), File "/opt/conda/lib/python2.7/subprocess.py", line 394, in init errread, errwrite) File "/opt/conda/lib/python2.7/subprocess.py", line 1047, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory


#### try again...

(citilearnsage) $ docker run -v $(pwd)/local_test/test_dir:/opt/ml citibike-learn-blah serve Starting the inference server with 4 workers. [2018-09-09 15:17:28 +0000] [13] [INFO] Starting gunicorn 19.9.0 [2018-09-09 15:17:28 +0000] [13] [INFO] Listening at: unix:/tmp/gunicorn.sock (13) [2018-09-09 15:17:28 +0000] [13] [INFO] Using worker: gevent [2018-09-09 15:17:28 +0000] [17] [INFO] Booting worker with pid: 17 [2018-09-09 15:17:28 +0000] [18] [INFO] Booting worker with pid: 18 [2018-09-09 15:17:28 +0000] [20] [INFO] Booting worker with pid: 20 [2018-09-09 15:17:28 +0000] [22] [INFO] Booting worker with pid: 22


* ok nice workers spawned now.

### Ref
[1] https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html

namoopsoo commented 6 years ago

try to push to repository now...

$(aws --profile myblahprofile ecr get-login --no-include-email --region us-east-1)

docker tag citibike-learn-blah:latest citibike-learn-blah:0.1.0

docker tag citibike-learn-blah:latest xxxx.dkr.ecr.us-east-1.amazonaws.com/citibike-learn-blah:latest

docker push xxxx.dkr.ecr.us-east-1.amazonaws.com/citibike-learn-blah:latest
docker push xxxx.dkr.ecr.us-east-1.amazonaws.com/citibike-learn-blah:0.1.0

Quick notes

# serve locally..
docker run -p 8080:8080 -v $(pwd)/local_test/test_dir:/opt/ml citibike-learn-blah  serve

# run interactively for  local debugging..
docker run -p 8080:8080 -v $(pwd)/local_test/test_dir:/opt/ml -t -i citibike-learn-blah

namoopsoo commented 6 years ago

kept getting a `ValidationException` when creating a new model custom SageMaker model

Thats weird though, all of a sudden it is working now.
the only thing i changed was that now i added some javascript permission on my browser i didnt before. hmm.
```
# custom role, 
arn:aws:iam::<account id>:role/mySageMakerFullRole
```

..

.dkr.ecr.us-east-1.amazonaws.com/citibike-learn-blah:0.1.4 # using the model which the training job created, s3://my-sagemaker-blah/bikelearn/artifacts/citibike-learn-first-job-6/output/model.tar.gz ``` #### quickest way to test a sagemaker endpoint , through aws i think ``` aws --profile myblahprofile sagemaker-runtime invoke-endpoint \ --endpoint-name "bikelearn-jade-bacon" \ --content-type 'text/csv' \ --body "starttime,start station name,usertype,birth year,gender\n10/1/2015 00:00:02,W 26 St & 10 Ave,Subscriber,1973,1\n10/1/2015 00:00:02,E 39 St & 2 Ave,Subscriber,1990,1" \ 'blahoutfile.json' ``` * hmm.. got a 500 #### make some changes, try again * ,, this time also incorporating a headerless update and more debug ``` aws --profile myblahprofile sagemaker-runtime invoke-endpoint \ --endpoint-name "bikelearn-astral-ankle" \ --content-type 'text/csv' \ --body "10/1/2015 00:00:02,W 26 St & 10 Ave,Subscriber,1973,1\n10/1/2015 00:00:02,E 39 St & 2 Ave,Subscriber,1990,1" \ 'blahoutfile.json' ``` #### hmm okay looking at the sagemaker logs, it looks like.. * the `\n` is escaped as `\\n` so since i'm sending two rows maybe i'll just send one insteaad ``` aws --profile myblahprofile sagemaker-runtime invoke-endpoint \ --endpoint-name "bikelearn-astral-ankle" \ --content-type 'text/csv' \ --body "10/1/2015 00:00:02,W 26 St & 10 Ave,Subscriber,1973,1" \ 'blahoutfile.json' ``` * Yay! worked.

namoopsoo commented 6 years ago

the basic POC model only returning 8?! Ruh ro.

I noticed that the outputs for several inputs has been just 8.

Glancing at a few outputs...

In [67]: for i in range(10):
...:     data = make_data_dict_from_row_data(traindf.iloc[i].to_dict())
...:     print data, call_api(data, url, headers).json()
...:     
{'blah': {'start_station': 'Broadway & W 55 St', 'start_time': '10/8/2015 11:01:24', 'rider_gender': '0', 'rider_type': 'Customer', 'birth_year': ''}} {u'output': u'8\n'}
{'blah': {'start_station': 'Barclay St & Church St', 'start_time': '10/23/2015 11:55:30', 'rider_gender': '0', 'rider_type': 'Customer', 'birth_year': ''}} {u'output': u'8\n'}
{'blah': {'start_station': 'Broadway & W 58 St', 'start_time': '10/15/2015 19:28:09', 'rider_gender': '0', 'rider_type': 'Customer', 'birth_year': ''}} {u'output': u'8\n'}
{'blah': {'start_station': 'Berry St & N 8 St', 'start_time': '10/5/2015 11:34:35', 'rider_gender': '1', 'rider_type': 'Subscriber', 'birth_year': '1973'}} {u'output': u'8\n'}
{'blah': {'start_station': 'W 13 St & 7 Ave', 'start_time': '10/17/2015 20:57:15', 'rider_gender': '1', 'rider_type': 'Subscriber', 'birth_year': '1970'}} {u'output': u'8\n'}
{'blah': {'start_station': 'Broadway & W 60 St', 'start_time': '10/14/2015 17:03:13', 'rider_gender': '2', 'rider_type': 'Subscriber', 'birth_year': '1985'}} {u'output': u'8\n'}
{'blah': {'start_station': 'State St & Smith St', 'start_time': '10/17/2015 09:58:27', 'rider_gender': '2', 'rider_type': 'Subscriber', 'birth_year': '1985'}} {u'output': u'8\n'}
{'blah': {'start_station': 'W 37 St & 10 Ave', 'start_time': '10/6/2015 18:02:11', 'rider_gender': '1', 'rider_type': 'Subscriber', 'birth_year': '1973'}} {u'output': u'8\n'}
{'blah': {'start_station': 'Stanton St & Chrystie St', 'start_time': '10/22/2015 15:23:09', 'rider_gender': '1', 'rider_type': 'Subscriber', 'birth_year': '1989'}} {u'output': u'8\n'}
{'blah': {'start_station': 'Norfolk St & Broome St', 'start_time': '10/7/2015 22:32:43', 'rider_gender': '1', 'rider_type': 'Subscriber', 'birth_year': '1954'}} {u'output': u'8\n'}

Okay, looking at a hundred examples randomly, but still getting just `8`.

In [68]: outputs = []

In [69]: traindf.shape
Out[69]: (969821, 16)

In [71]: for _ in range(100):
    ...:     i = random.randint(1111, 969820)
    ...:     data = make_data_dict_from_row_data(traindf.iloc[i].to_dict())
    ...:     # print data, call_api(data, url, headers).json()
    ...:     out = call_api(data, url, headers)
    ...:     try:
    ...:         out_json = out.json()
    ...:     except ValueError:
    ...:         out_json = {'text': out.text}
    ...:     except Exception as e:
    ...:         out_json = {'text': out.text, 'e': str(e.message)}
    ...:     outputs.append({'data': data, 'out_json': out_json, 'i': i})
    ...:     
    ...:     

In [72]: len(outputs)
Out[72]: 100

In [73]: outputs[0]
Out[73]: 
{'data': {'blah': {'birth_year': '1993',
   'rider_gender': '1',
   'rider_type': 'Subscriber',
   'start_station': 'Washington Pl & Broadway',
   'start_time': '10/28/2015 13:09:23'}},
 'i': 875240,
 'out_json': {u'output': u'8\n'}}

In [75]: not8 = [x for x in outputs if x['out_json']['output'] != u'8\n']

In [76]: len(not8)
Out[76]: 0

In [79]: bundle['label_encoders']
Out[79]: 
{'age': LabelEncoder(),
 'end_neighborhood': LabelEncoder(),
 'start_neighborhood': LabelEncoder(),
 'start_postal_code': LabelEncoder(),
 'start_sublocality': LabelEncoder(),
 'usertype': LabelEncoder()}

In [80]: bundle['label_encoders']['end_neighborhood']
Out[80]: LabelEncoder()

In [81]: vars(bundle['label_encoders']['end_neighborhood'])
Out[81]: 
{'classes_': array(['-1', 'Bedford-Stuyvesant', 'Brooklyn Heights',
        'Downtown Brooklyn', 'Fort Greene', 'Greenpoint',
        'Long Island City', 'Williamsburg', 'nan'], dtype=object)}

In [82]: len(vars(bundle['label_encoders']['end_neighborhood'])['classes_'])
Out[82]: 9

Need a deeper dive.

Noticed one thing that, the label encoders in the bundle, even though there are nearly a million trips in that dataset, there are only about 7 neighborhoods, just brooklyn.

And looking at my stations_df, there are plenty of manhattan neighborhoods in there. .


(Pdb) len(stations_df.neighborhood.value_counts())
23
(Pdb) stations_df.neighborhood.value_counts()
Lower Manhattan       116
Midtown               113
Williamsburg           42
Bedford-Stuyvesant     34
Upper East Side        20
Upper West Side        17
Greenpoint             15
Downtown Brooklyn      14
Fort Greene            13
Long Island City       13
Clinton Hill           10
Brooklyn Heights       10
Lower East Side         6
Midtown East            6
Midtown West            3
Hell's Kitchen          3
Lincoln Square          3
Central Park            3
Boerum Hill             2
East Village            2
Dumbo                   2
Alphabet City           1
Yorkville               1
Name: neighborhood, dtype: int64
(Pdb)

* also.. stations df has a `13/465` which are `nan`. not ideal.
```python
(Pdb) pp stations_df.head()
                  station_name postal_code sublocality     neighborhood state
0             W 26 St & 10 Ave       10001   Manhattan          Midtown    NY
1              E 39 St & 2 Ave       10016   Manhattan          Midtown    NY
2              8 Ave & W 52 St       10019   Manhattan          Midtown    NY
3  Sullivan St & Washington Sq       10012   Manhattan  Lower Manhattan    NY
4     Bedford Ave & Nassau Ave       11222    Brooklyn       Greenpoint    NY
(Pdb) pp stations_df[stations_df.neighborhood.isnull()].shape
(13, 5)
(Pdb) pp stations_df[stations_df.neighborhood.isnull()]
                     station_name postal_code sublocality neighborhood state
29   Clermont Ave & Lafayette Ave       11238    Brooklyn          NaN    NY
45    Montrose Ave & Bushwick Ave       11206    Brooklyn          NaN    NY
66            Nassau St & Navy St       11201    Brooklyn          NaN    NY
199           FDR Drive & E 35 St       10016   Manhattan          NaN    NY
203       Fulton St & Rockwell Pl       11201    Brooklyn          NaN    NY
230              York St & Jay St       11201    Brooklyn          NaN    NY
233    Carlton Ave & Flushing Ave       11205    Brooklyn          NaN    NY
328   Central Park West & W 68 St       10023   Manhattan          NaN    NY
336            Front St & Gold St       11201    Brooklyn          NaN    NY
392            Sands St & Navy St       11201    Brooklyn          NaN    NY
396       3 Ave & Schermerhorn St       11217    Brooklyn          NaN    NY
412    Clinton Ave & Flushing Ave       11205    Brooklyn          NaN    NY
440        Railroad Ave & Kay Ave       11249    Brooklyn          NaN    NY

wait a second, what the heck, it's even worse on the stations_df on the bundle

In [98]: stations_df = bundle['train_metadata']['stations_df']

In [99]: stations_df.shape
Out[99]: (462, 6)

In [100]: stations_df.head()
Out[100]: 
   Unnamed: 0                 station_name  postal_code sublocality neighborhood state
0           0             W 26 St & 10 Ave          NaN   Manhattan          NaN    NY
1           1              E 39 St & 2 Ave          NaN   Manhattan          NaN    NY
2           2              8 Ave & W 52 St          NaN   Manhattan          NaN    NY
3           3  Sullivan St & Washington Sq          NaN   Manhattan          NaN    NY
4           4     Bedford Ave & Nassau Ave          NaN    Brooklyn          NaN    NY

In [101]: stations_df[stations_df.neighborhood.isnull()].shape
Out[101]: (421, 6)

namoopsoo commented 6 years ago

okay, after fixing the stations_df , that particular `nan` problem is fixed!

So running initially, I thought the output was the same each time again. But actually the output is the same three classes actually. No more nans though.

url = 'https://rmuxqpksz2.execute-api.us-east-1.amazonaws.com/default/myBikelearnSageLambda' import requests def call_api(data, url, headers): out = requests.post(url, json=data, headers=headers) return out

outputs = [] for _ in range(100): i = random.randint(1111, 969820) data = make_data_dict_from_row_data(traindf.iloc[i].to_dict())

print data, call_api(data, url, headers).json()

out = call_api(data, url, headers)
try:
    out_json = out.json()
except ValueError:
    out_json = {'text': out.text}
except Exception as e:
    out_json = {'text': out.text, 'e': str(e.message)}
outputs.append({'data': data, 'out_json': out_json, 'i': i})

In [114]: Counter([x['out_json']['output'].strip() for x in outputs]) Out[114]: Counter({u'17': 100})

In [116]: with open('/Users/michal/Downloads/2018-10-21-comparingmodels/job-7/build/tree-foo-bundle.2018-10-21T205144ZUTC.pkl') as fd: ...: winterbundle = cPickle.load(fd) ...:

In [124]: winterbundle['label_encoders']['endneighborhood'].classes Out[124]: array(['-1', 'Alphabet City', 'Bedford-Stuyvesant', 'Boerum Hill', 'Brooklyn Heights', 'Central Park', 'Clinton Hill', 'Downtown Brooklyn', 'Dumbo', 'East Village', 'Fort Greene', 'Greenpoint', "Hell's Kitchen", 'Lincoln Square', 'Long Island City', 'Lower East Side', 'Lower Manhattan', 'Midtown', 'Midtown East', 'Midtown West', 'Murray Hill', 'Navy Yard', 'Upper East Side', 'Upper West Side', 'Vinegar Hill', 'Williamsburg', 'Yorkville'], dtype=object)

In [125]: len(winterbundle['label_encoders']['endneighborhood'].classes) Out[125]: 27

In [126]:

In [126]: winterbundle['evaluation'] Out[126]: {'validation_proportion_correct': 0.46358773861778346}

In [127]:

In [127]: simpledf.shape Out[127]: (969384, 31)

In [155]: print data {'blah': {'start_station': 'Forsyth St & Broome St', 'start_time': '10/8/2015 18:04:57', 'rider_gender': '2', 'rider_type': 'Subscriber', 'birth_year': '1973'}}

In [156]: foo = call_api(data, url, headers)

* So yea looks like that works. and ran on full dataset too..

In [138]: %time y_predictionsmany, y_testmany = blc.run_model_predict(winterbundle, atraindf, fullerstations_df, True) CPU times: user 2min 7s, sys: 4.41 s, total: 2min 11s Wall time: 2min 9s

In [139]: len(y_predictionsmany), len(y_testmany) Out[139]: (969384, 969384)

In [140]: %paste def get_basic_proportion_correct(y_test, y_predictions): zipped = zip(y_test, y_predictions) correct = len([[x,y] for x,y in zipped if x == y]) proportion_correct = 1.0*correct/y_test.shape[0] return proportion_correct

-- End pasted text --

In [141]: get_basic_proportion_correct(y_predictionsmany, y_testmany) Out[141]: 0.4634850585526479

In [143]: Counter(y_predictionsmany) Out[143]: Counter({16: 123450, 17: 811252, 25: 34682})

namoopsoo commented 6 years ago

ok cool. after updating stations_df, now got rid of the `nans` problem

So that was a pretty cool learnings!


In [138]: %time y_predictionsmany, y_testmany = blc.run_model_predict(winterbundle, atraindf, fullerstations_df, True)
CPU times: user 2min 7s, sys: 4.41 s, total: 2min 11s
Wall time: 2min 9s

In [139]:

In [139]: len(y_predictionsmany), len(y_testmany) Out[139]: (969384, 969384)

-- End pasted text --

In [141]: get_basic_proportion_correct(y_predictionsmany, y_testmany) Out[141]: 0.4634850585526479

In [142]: Counter(y_predictionsmany[:10]) Out[142]: Counter({16: 1, 17: 9})

In [143]: Counter(y_predictionsmany) Out[143]: Counter({16: 123450, 17: 811252, 25: 34682})

* ,
```python
In [152]: print dict(Counter([x[0] for x in y_testmany]))
{1: 1231, 2: 9011, 3: 2798, 4: 7428, 5: 6826, 6: 6835, 7: 12503, 8: 4165, 9: 6277, 10: 9415, 11: 9750, 12: 7191, 13: 5080, 14: 5628, 15: 12146, 16: 322132, 17: 409018, 18: 12325, 19: 9641, 20: 2444, 21: 1467, 22: 39364, 23: 34895, 24: 629, 25: 29848, 26: 1337}

url = 'https://rmuxqpksz2.execute-api.us-east-1.amazonaws.com/default/myBikelearnSageLambda'
import requests
def call_api(data, url, headers):

    out = requests.post(url, json=data, headers=headers)
    return out

In [154]: %history 109 
outputs = []
for _ in range(100):
    i = random.randint(1111, 969820)
    data = make_data_dict_from_row_data(traindf.iloc[i].to_dict())
    # print data, call_api(data, url, headers).json()
    out = call_api(data, url, headers)
    try:
        out_json = out.json()
    except ValueError:
        out_json = {'text': out.text}
    except Exception as e:
        out_json = {'text': out.text, 'e': str(e.message)}
    outputs.append({'data': data, 'out_json': out_json, 'i': i})

In [155]: print data
{'blah': {'start_station': 'Forsyth St & Broome St', 'start_time': '10/8/2015 18:04:57', 'rider_gender': '2', 'rider_type': 'Subscriber', 'birth_year': '1973'}}

In [156]: foo = call_api(data, url, headers)

In [157]: foo
Out[157]: <Response [200]>

In [158]: print foo.json()
{u'output': u'17\n'}

In [160]: call_api(data, url, {})
Out[160]: <Response [200]>

In [161]: call_api(data, url, {}).json()
Out[161]: {u'output': u'17\n'}

namoopsoo commented 6 years ago

more metrics...

My current sole metric is pretty basic.
ok, working on the top k ranked predict proba metric ...

added a quick unit test around this ...


> bikelearn/tests/test_metrics.py(22)test_basic()
-> pass
(Pdb) pp zip(y_test, sorted_outputs)
[(25, [17, 16, 25, 22, 23]),
(16, [17, 16, 22, 23, 18]),
(16, [17, 16, 22, 23, 25]),
(17, [17, 16, 22, 23, 18]),
(16, [17, 16, 22, 23, 25]),
(16, [17, 16, 22, 23, 18]),
(16, [17, 16, 22, 23, 18]),
(16, [17, 16, 22, 23, 18]),
(16, [17, 16, 25, 7, 22]),
(16, [17, 16, 22, 23, 25]),
(16, [17, 16, 22, 23, 25]),
(2, [17, 16, 25, 7, 22]),
(25, [17, 16, 22, 23, 25]),
(17, [17, 16, 22, 23, 25]),
(16, [17, 16, 22, 23, 25]),
(25, [17, 16, 25, 7, 22]),
(25, [17, 16, 22, 23, 25]),
(22, [17, 16, 22, 23, 25]),
(17, [17, 16, 22, 23, 25]),
(17, [17, 16, 22, 23, 18])]

namoopsoo commented 6 years ago

still seem to be seeing `'nan'` calculated in the proportion correct data..

probably because of the warning i'm seeing...

!missing stations: set(['MacDougal St & Prince St', 'Washington Square E', 'E 81 St & York Ave', 'Schermerhorn St & Court St', 'E 77 St & Park Ave', 'Center Blvd & Borden Ave', 'E 77 St & 3 Ave', 'E 71 St & 2 Ave', 'University Pl & E 8 St', 'Leonard St & Meeker Ave', 'PABT Valet', 'Broadway & Roebling St', 'E 80 St & 2 Ave', 'Union Ave & N 12 St', 'Monroe St & Tompkins Ave', 'E 40 St & 5 Ave', 'E 58 St & 1 Ave'])

> /opt/conda/lib/python2.7/site-packages/bikelearn/metrics_utils.py(14)do_validation()
     13     classes = clf.classes_
---> 14     y_predict_proba = clf.predict_proba(X_validation)
     15 

ipdb> pp classes
array([1, 2, 3, 4, 5, 6, 7, 8])
ipdb> n
> /opt/conda/lib/python2.7/site-packages/bikelearn/metrics_utils.py(16)do_validation()
     15 
---> 16     metrics = gather_metrics(y_validation, y_predictions, y_predict_proba, classes)
     17     return metrics

ipdb> from collections import Counter
ipdb> Counter(y_predictions)
Counter({8: 100235})
ipdb> Counter(y_validation)
Counter({8: 97977, 7: 862, 5: 364, 4: 357, 6: 255, 1: 174, 3: 130, 2: 116})
ipdb>

looking at confusion matrix, basically makes this nearly obvious,

ipdb> pp skm.confusion_matrix(y_validation, y_predictions, classes)
array([[    0,     0,     0,     0,     0,     0,     0,   174],
       [    0,     0,     0,     0,     0,     0,     0,   116],
       [    0,     0,     0,     0,     0,     0,     0,   130],
       [    0,     0,     0,     0,     0,     0,     0,   357],
       [    0,     0,     0,     0,     0,     0,     0,   364],
       [    0,     0,     0,     0,     0,     0,     0,   255],
       [    0,     0,     0,     0,     0,     0,     0,   862],
       [    0,     0,     0,     0,     0,     0,     0, 97977]])
ipdb>

ah crap, i had a line that said dont mark this correct if the predition is nan, but doh, that doesnt work because that is pre-label encoding!


> /opt/conda/lib/python2.7/site-packages/bikelearn/metrics_utils.py(48)get_proportion_correct()
 47     zipped = zip(y_validation, y_predictions_validation)
---> 48     correct = len([[x,y] for x,y in zipped if x in y and y != 'nan'])
 49     proportion_correct = 1.0*correct/y_validation.shape[0]

ipdb> pp zipped[:3] [(8, [8]), (8, [8]), (8, [8])] ipdb>


####  here..
```python

ipdb> pp label_encoders
{'age': LabelEncoder(),
 'end_neighborhood': LabelEncoder(),
 'start_neighborhood': LabelEncoder(),
 'start_postal_code': LabelEncoder(),
 'start_sublocality': LabelEncoder(),
 'usertype': LabelEncoder()}
ipdb> pp label_encoders['end_neighborhood']
LabelEncoder()
ipdb> pp label_encoders['end_neighborhood'].classes_
array(['-1', 'Bedford-Stuyvesant', 'Brooklyn Heights',
       'Downtown Brooklyn', 'Fort Greene', 'Greenpoint',
       'Long Island City', 'Williamsburg', 'nan'], dtype=object)
ipdb>

adding a new assert just in case..

    assert not any(['nan' in le.classes_ for le in label_encoders.values()])

namoopsoo commented 6 years ago

ok try to get an update for stations..

import bikelearn.settings as s
s.GOOGLE_GEO_API_KEY

address = "W 26 St & 10 Ave"
import bikelearn.get_station_geolocation_data as getgeo

address = "W 26 St & 10 Ave"
data = getgeo.get_geocoding_results(address)
help(getgeo.get_geocoding_results)
getgeo.get_geocoding_results??
import ipdb
getgeo.get_geocoding_results??
data = ipdb.runcall(getgeo.get_geocoding_results, address, request_type='geo')

## stations file
stations_json_filename = \
        'datas/start_stations_103115.fuller.csv'
stationsdf = pd.read_csv(stations_json_filename, index_col=0)

foo = ipdb.runcall(getgeo.get_station_geoloc_data, stationsdf=stationsdf.iloc[:5])

..looking at my station data... its got some not ideal stuff in there...

In [84]: validsdf.levels.value_counts()
Out[84]: 
4    428
2    301
3     95
1     56
0      3
Name: levels, dtype: int64

In [85]: validsdf['type'] = validsdf['address'].map(lambda x: 'geo' if x.endswith('NY'
    ...: ) else 'latlng')

In [86]: validsdf[validsdf['type'] == 'geo'].levels.value_counts()
Out[86]: 
2    300
3     82
1     56
4     21
0      3
Name: levels, dtype: int64

In [87]: validsdf[validsdf['type'] == 'latlng'].levels.value_counts()
Out[87]: 
4    407
3     13
2      1
Name: levels, dtype: int64

deleted the incomplete ones. will try that agin .

In [99]: getgeo.redis_client.hdel(s.GEO_RAW_RESULTS, *addresses_todelete)
Out[99]: 441

namoopsoo commented 5 years ago

whoops

ipdb> pp geocoding_result
{u'error_message': u'You have exceeded your daily request quota for this API. If you did not set a custom daily request quota, verify your project has an active billing account: http://g.co/dev/maps-no-account',
 u'results': [],
 u'status': u'OVER_QUERY_LIMIT'}

having this result, where intersection i put in is recognized as a street

had this happening before too i remember !

when using E 3 St & 1 Ave, NY for instance ...


ipdb> pp raw_results_list
[{u'address_components': [{u'long_name': u'East 3rd Street',
                       u'short_name': u'E 3rd St',
                       u'types': [u'route']},
                      {u'long_name': u'Manhattan',
                       u'short_name': u'Manhattan',
                       u'types': [u'political',
                                  u'sublocality',
                                  u'sublocality_level_1']},
                      {u'long_name': u'New York',
                       u'short_name': u'New York',
                       u'types': [u'locality', u'political']},
                      {u'long_name': u'New York County',
                       u'short_name': u'New York County',
                       u'types': [u'administrative_area_level_2',
                                  u'political']},
                      {u'long_name': u'New York',
                       u'short_name': u'NY',
                       u'types': [u'administrative_area_level_1',
                                  u'political']},
                      {u'long_name': u'United States',
                       u'short_name': u'US',
                       u'types': [u'country', u'political']}],
u'formatted_address': u'E 3rd St, New York, NY, USA',
u'geometry': {u'bounds': {u'northeast': {u'lat': 40.7267011,
                                       u'lng': -73.9778606},
                        u'southwest': {u'lat': 40.7203954,
                                       u'lng': -73.99186399999999}},
            u'location': {u'lat': 40.7236548, u'lng': -73.9852944},
            u'location_type': u'GEOMETRIC_CENTER',
            u'viewport': {u'northeast': {u'lat': 40.7267011,
                                         u'lng': -73.9778606},
                          u'southwest': {u'lat': 40.7203954,
                                         u'lng': -73.99186399999999}}},
u'place_id': u'ChIJuWov94JZwokRy-aQnT-VklI',
u'types': [u'route']}]

namoopsoo commented 5 years ago

after deleting and retrying, results were crappy again, but i might know why..

stationsdf = ipdb.runcall(getgeo.get_station_geoloc_data, stations[:50])

tracked down an edit in git log ,

commit 438e425482db1c105ae6a22b8248696d0f91dfef
Date:   Mon Oct 2 17:56:15 2017 -0400

...

In that commit ^^ I accidentally undid the url-encoding and intersections with & are I think treated as query string parameters . Grr...

Update code and try again...

and cool. after encoding properly, of course result is meaningful.

namoopsoo commented 5 years ago

Hmm, but still problematic.

stationsdf = ipdb.runcall(getgeo.get_station_geoloc_data, stations[:50])

In [167]: getgeo.extract_lat_lng_from_response(out.json()['results'])
Out[167]: {u'lat': 40.7957399, u'lng': -73.93892129999999}

some places still dont have a neighhborhood, even after hitting using straight long/lat, like 'Clinton Ave & Flushing Ave' . Should determine how many are like this and then perhaps move a a mile somewhere and try to get a closest match neighborhood if possible

The station gathering code ... , also,

stations = getgeo.extract_stations_from_files(filenames=[
    "201510-citibike-tripdata.csv", "201601-citibike-tripdata.csv",
    ...
    ]

{u'plus_code': {u'compound_code': u'M2XJ+44 New York, NY, USA',
                u'global_code': u'87G8M2XJ+44'},
 u'results': [{u'address_components': [{u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'Clinton Ave & Flushing Ave, United States',
               u'geometry': {u'location': {u'lat': 40.69794,
                                           u'lng': -73.96986849999999},
                             u'location_type': u'GEOMETRIC_CENTER',
                             u'viewport': {u'northeast': {u'lat': 40.6992889802915,
                                                          u'lng': -73.9685195197085},
                                           u'southwest': {u'lat': 40.6965910197085,
                                                          u'lng': -73.9712174802915}}},
               u'place_id': u'ChIJKwagFMZbwokRtjuK5VRvb-E',
               u'plus_code': {u'compound_code': u'M2XJ+53 New York, United States',
                              u'global_code': u'87G8M2XJ+53'},
               u'types': [u'establishment', u'point_of_interest']},
              {u'address_components': [{u'long_name': u'164',
                                        u'short_name': u'164',
                                        u'types': [u'street_number']},
                                       {u'long_name': u'Flushing Avenue',
                                        u'short_name': u'Flushing Ave',
                                        u'types': [u'route']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'11205',
                                        u'short_name': u'11205',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'164 Flushing Ave, Brooklyn, NY 11205, USA',
               u'geometry': {u'location': {u'lat': 40.6976577,
                                           u'lng': -73.9699304},
                             u'location_type': u'ROOFTOP',
                             u'viewport': {u'northeast': {u'lat': 40.6990066802915,
                                                          u'lng': -73.9685814197085},
                                           u'southwest': {u'lat': 40.6963087197085,
                                                          u'lng': -73.9712793802915}}},
               u'place_id': u'ChIJ3TPWP8ZbwokR8dlF4H6Vr8Q',
               u'plus_code': {u'compound_code': u'M2XJ+32 New York, United States',
                              u'global_code': u'87G8M2XJ+32'},
               u'types': [u'street_address']},
              {u'address_components': [{u'long_name': u'168',
                                        u'short_name': u'168',
                                        u'types': [u'street_number']},
                                       {u'long_name': u'Flushing Avenue',
                                        u'short_name': u'Flushing Ave',
                                        u'types': [u'route']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'11205',
                                        u'short_name': u'11205',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'168 Flushing Ave, Brooklyn, NY 11205, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.6976934,
                                                        u'lng': -73.96936649999999},
                                         u'southwest': {u'lat': 40.697549,
                                                        u'lng': -73.969557}},
                             u'location': {u'lat': 40.6976345,
                                           u'lng': -73.969481},
                             u'location_type': u'ROOFTOP',
                             u'viewport': {u'northeast': {u'lat': 40.6989701802915,
                                                          u'lng': -73.96811276970848},
                                           u'southwest': {u'lat': 40.6962722197085,
                                                          u'lng': -73.9708107302915}}},
               u'place_id': u'ChIJ43kQasZbwokRMEsGfYDYL6Y',
               u'types': [u'premise']},
              {u'address_components': [{u'long_name': u'99',
                                        u'short_name': u'99',
                                        u'types': [u'street_number']},
                                       {u'long_name': u'Flushing Avenue',
                                        u'short_name': u'Flushing Ave',
                                        u'types': [u'route']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'11205',
                                        u'short_name': u'11205',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'99 Flushing Ave, Brooklyn, NY 11205, USA',
               u'geometry': {u'location': {u'lat': 40.6978344,
                                           u'lng': -73.96973249999999},
                             u'location_type': u'RANGE_INTERPOLATED',
                             u'viewport': {u'northeast': {u'lat': 40.69918338029149,
                                                          u'lng': -73.96838351970848},
                                           u'southwest': {u'lat': 40.69648541970849,
                                                          u'lng': -73.9710814802915}}},
               u'place_id': u'Eig5OSBGbHVzaGluZyBBdmUsIEJyb29rbHluLCBOWSAxMTIwNSwgVVNBIhoSGAoUChIJtcpdFsZbwokRiFlwXEBWntUQYw',
               u'types': [u'street_address']},
              {u'address_components': [{u'long_name': u'Clinton Avenue',
                                        u'short_name': u'Clinton Ave',
                                        u'types': [u'route']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']},
                                       {u'long_name': u'11205',
                                        u'short_name': u'11205',
                                        u'types': [u'postal_code']}],
               u'formatted_address': u'Clinton Ave, Brooklyn, NY 11205, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.6978344,
                                                        u'lng': -73.96973249999999},
                                         u'southwest': {u'lat': 40.6977702,
                                                        u'lng': -73.9697406}},
                             u'location': {u'lat': 40.6978023,
                                           u'lng': -73.96973659999999},
                             u'location_type': u'GEOMETRIC_CENTER',
                             u'viewport': {u'northeast': {u'lat': 40.6991512802915,
                                                          u'lng': -73.96838756970848},
                                           u'southwest': {u'lat': 40.6964533197085,
                                                          u'lng': -73.9710855302915}}},
               u'place_id': u'ChIJu0sva8ZbwokRrMKu_ii2I2Y',
               u'types': [u'route']},
              {u'address_components': [{u'long_name': u'11205',
                                        u'short_name': u'11205',
                                        u'types': [u'postal_code']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'Brooklyn, NY 11205, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.7066249,
                                                        u'lng': -73.948331},
                                         u'southwest': {u'lat': 40.68741490000001,
                                                        u'lng': -73.980632}},
                             u'location': {u'lat': 40.6945036,
                                           u'lng': -73.9565551},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 40.7066249,
                                                          u'lng': -73.948331},
                                           u'southwest': {u'lat': 40.68741490000001,
                                                          u'lng': -73.980632}}},
               u'place_id': u'ChIJLywE8sZbwokRiLapmmo79YU',
               u'types': [u'postal_code']},
              {u'address_components': [{u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'Kings County, Brooklyn, NY, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.739446,
                                                        u'lng': -73.8333651},
                                         u'southwest': {u'lat': 40.551042,
                                                        u'lng': -74.05663}},
                             u'location': {u'lat': 40.6528762,
                                           u'lng': -73.95949399999999},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 40.739446,
                                                          u'lng': -73.8333651},
                                           u'southwest': {u'lat': 40.551042,
                                                          u'lng': -74.05663}}},
               u'place_id': u'ChIJOwE7_GTtwokRs75rhW4_I6M',
               u'types': [u'administrative_area_level_2', u'political']},
              {u'address_components': [{u'long_name': u'Brooklyn',
                                        u'short_name': u'Brooklyn',
                                        u'types': [u'political',
                                                   u'sublocality',
                                                   u'sublocality_level_1']},
                                       {u'long_name': u'Kings County',
                                        u'short_name': u'Kings County',
                                        u'types': [u'administrative_area_level_2',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'Brooklyn, NY, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.739446,
                                                        u'lng': -73.8333651},
                                         u'southwest': {u'lat': 40.551042,
                                                        u'lng': -74.05663}},
                             u'location': {u'lat': 40.6781784,
                                           u'lng': -73.9441579},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 40.739446,
                                                          u'lng': -73.8333651},
                                           u'southwest': {u'lat': 40.551042,
                                                          u'lng': -74.05663}}},
               u'place_id': u'ChIJCSF8lBZEwokRhngABHRcdoI',
               u'types': [u'political',
                          u'sublocality',
                          u'sublocality_level_1']},
              {u'address_components': [{u'long_name': u'New York',
                                        u'short_name': u'New York',
                                        u'types': [u'locality',
                                                   u'political']},
                                       {u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'New York, NY, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 40.9175771,
                                                        u'lng': -73.70027209999999},
                                         u'southwest': {u'lat': 40.4773991,
                                                        u'lng': -74.25908989999999}},
                             u'location': {u'lat': 40.7127753,
                                           u'lng': -74.0059728},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 40.9175771,
                                                          u'lng': -73.70027209999999},
                                           u'southwest': {u'lat': 40.4773991,
                                                          u'lng': -74.25908989999999}}},
               u'place_id': u'ChIJOwg_06VPwokRYv534QaPC8g',
               u'types': [u'locality', u'political']},
              {u'address_components': [{u'long_name': u'New York',
                                        u'short_name': u'NY',
                                        u'types': [u'administrative_area_level_1',
                                                   u'political']},
                                       {u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'New York, USA',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 45.015865,
                                                        u'lng': -71.777491},
                                         u'southwest': {u'lat': 40.4773991,
                                                        u'lng': -79.7625901}},
                             u'location': {u'lat': 43.2994285,
                                           u'lng': -74.21793260000001},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 45.015865,
                                                          u'lng': -71.777491},
                                           u'southwest': {u'lat': 40.4773991,
                                                          u'lng': -79.7625901}}},
               u'place_id': u'ChIJqaUj8fBLzEwRZ5UY3sHGz90',
               u'types': [u'administrative_area_level_1', u'political']},
              {u'address_components': [{u'long_name': u'United States',
                                        u'short_name': u'US',
                                        u'types': [u'country',
                                                   u'political']}],
               u'formatted_address': u'United States',
               u'geometry': {u'bounds': {u'northeast': {u'lat': 71.5388001,
                                                        u'lng': -66.885417},
                                         u'southwest': {u'lat': 18.7763,
                                                        u'lng': 170.5957}},
                             u'location': {u'lat': 37.09024,
                                           u'lng': -95.712891},
                             u'location_type': u'APPROXIMATE',
                             u'viewport': {u'northeast': {u'lat': 71.5388001,
                                                          u'lng': -66.885417},
                                           u'southwest': {u'lat': 18.7763,
                                                          u'lng': 170.5957}}},
               u'place_id': u'ChIJCzYy5IS16lQRQrfeQ5K5Oxw',
               u'types': [u'country', u'political']}],
 u'status': u'OK'}

namoopsoo commented 5 years ago

station lat long also changed. darn dedupe still other concerns then.


In [421]: sorted(dedupeddf['start station name'].value_counts().to_dict().items(), key
     ...: =lambda x: x[1])[-1]
Out[421]: ('Kent Ave & N 7 St', 3)

In [422]: dedupeddf[dedupeddf['start station name'] == 'Kent Ave & N 7 St']
Out[422]: 
    start station name              ...                                     latlng
360  Kent Ave & N 7 St              ...               40.7207255341,-73.9612591267
433  Kent Ave & N 7 St              ...                   40.72057658,-73.96150225
441  Kent Ave & N 7 St              ...                40.720367753,-73.9616507292

[3 rows x 4 columns]

In [423]: dedupeddf[dedupeddf['start station name'] == 'Kent Ave & N 7 St'].values
Out[423]: 
array([['Kent Ave & N 7 St', 40.720725534125954, -73.9612591266632,
        '40.7207255341,-73.9612591267'],
       ['Kent Ave & N 7 St', 40.72057658, -73.96150225,
        '40.72057658,-73.96150225'],
       ['Kent Ave & N 7 St', 40.72036775298455, -73.96165072917937,
        '40.720367753,-73.9616507292']], dtype=object)

namoopsoo commented 5 years ago

dedupe differently

 %time stationsdf = getgeo.extract_stations_latlng_df_from_files(filenames=[
...
])
# Wall time: 1min 25s
# In [414]: stationsdf.shape
# Out[414]: (7127, 4)

dedupeddf = getgeo.some_stationdf_dedupe(stationsdf)
# In [431]: dedupeddf.shape
# Out[431]: (682, 4)

annotated_df = getgeo.annotate_station_df(dedupeddf)

newdf = pd.concat([dedupeddf, foodf], axis=1)

In [563]: handmade
Out[563]: 
[{'neighborhood': 'Fort Greene', 'station': 'DeKalb Ave & Hudson Ave'},
 {'neighborhood': u'NoMad',
  'postal_code': u'10001',
  'state': u'NY',
  'station': 'Broadway & W 29 St',
  'sublocality': u'Manhattan'},
 {'neighborhood': 'Brooklyn Navy Yard', 'station': 'Sands St & Gold St'},
 {'neighborhood': 'Dumbo', 'station': 'York St & Jay St'},
 {'neighborhood': 'Brooklyn Navy Yard',
  'station': 'Flushing Ave & Carlton Ave'},
 {'neighborhood': 'Columbia Street Waterfront District',
  'station': 'Atlantic Ave & Furman St'},
 {'neighborhood': 'Brooklyn Navy Yard', 'station': 'Railroad Ave & Kay Ave'},
 {'neighborhood': 'Brooklyn Navy Yard',
  'station': 'Clinton Ave & Flushing Ave'},
 {'neighborhood': 'Park Slope', 'station': 'Dean St & 4 Ave'},
 {'neighborhood': 'Brooklyn Navy Yard', 'station': 'Nassau St & Navy St'},
 {'neighborhood': 'Vinegar Hill', 'station': 'Front St & Gold St'},
 {'neighborhood': 'Brooklyn Navy Yard', 'station': '7 Ave & Farragut St'},
 {'neighborhood': 'Brooklyn Navy Yard', 'station': 'Sands St & Navy St'},
 {'neighborhood': 'Brooklyn Navy Yard',
  'station': 'Carlton Ave & Flushing Ave'},
 {'neighborhood': 'Financial District', 'station': 'Peck Slip & Front St'},
 {'neighborhood': 'Prospect Heights',
  'station': 'Bike The Branches - Central Branch'},
 {'neighborhood': 'Cobble Hill', 'station': 'Henry St & Degraw St'},
 {'neighborhood': 'Park Slope', 'station': 'Union St & 4 Ave'},
 {'neighborhood': 'Park Slope', 'station': 'Douglass St & 4 Ave'},
 {'neighborhood': 'Prospect Park',
  'station': 'Bike in Movie Night | Prospect Park Bandshell'},
 {'neighborhood': 'Park Slope', 'station': 'West Drive & Prospect Park West'},
 {'neighborhood': 'Park Slope', 'station': '4 Ave & 2 St'}]

handmadedf = handmadedf.rename(columns={'neighborhood': 'hand_neighborhood'})

moredf = newdf.merge(handmadedf, left_on='start station name', right_on='station', how='left')

# update those 22 hand made neighborhood labels. 
moredf.ix[moredf[moredf.neighborhood.isnull()].index.tolist(),'neighborhood']  = moredf.ix[moredf[moredf.neighborhood.isnull()].index.tolist(), 'hand_neighborhood']

moredf.drop(labels=['sublocality_y', 'state_y', 'postal_code_y'],axis=1, inplace=True)

moredf.rename(columns={'sublocality_x':'sublocality','state_x':'state','postal_code_x': 'postal_code'},inplace=True)

moredf.drop(labels=['hand_neighborhood', 'station'],axis=1, inplace=True)

In [602]: fn = '/........./learn-citibike/datas/station
     ...: s/stations-2018-12-04-b.pkl'

In [603]: moredf.to_pickle(fn)

namoopsoo commented 5 years ago

previously missing when training few weeks ago..


In [604]: previouslymissing = ['MacDougal St & Prince St', 'Washington Square E', 'E 8
     ...: 1 St & York Ave', 'Schermerhorn St & Court St', 'E 77 St & Park Ave', 'Cente
     ...: r Blvd & Borden Ave', 'E 77 St & 3 Ave', 'E 71 St & 2 Ave', 'University Pl &
     ...:  E 8 St', 'Leonard St & Meeker Ave', 'PABT Valet', 'Broadway & Roebling St',
     ...:  'E 80 St & 2 Ave', 'Union Ave & N 12 St', 'Monroe St & Tompkins Ave', 'E 40
     ...:  St & 5 Ave', 'E 58 St & 1 Ave']

In [605]: moredf['start station name'].value_counts().shape
Out[605]: (682,)

In [606]: moredf.shape
Out[606]: (682, 9)

In [607]: moredf[moredf['start station name'].isin(previouslymissing)].shape
Out[607]: (17, 9)

In [608]: len(previouslymissing)
Out[608]: 17

another re-train then.

namoopsoo commented 5 years ago

print one of the trees..

from sklearn import tree
from sklearn.externals.six import StringIO  
import pydot 

clffnpdf = 'myfile.pdf'

dot_data = StringIO()

# First one 
clf = bundle['clf'].estimators_[0] 
tree.export_graphviz(clf, out_file=dot_data) 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf(clffnpdf)

clf.1.dot.pdf

namoopsoo commented 5 years ago

ah darn batch transform size limitations?

I just updated my /invocations endpoint to support labeled data in order to do batch model validation on new datasets,

However, when I threw the 201603-citibike-tripdata.csv 170MiB file into S3 and specified this as the input , I got ,

2018/12/09 21:38:02 [error] 9#9: *4 client intended to send too large body: 6291424 bytes, client: 169.254.255.130, server: , request: "POST /invocations HTTP/1.1", host: "169.254.255.131:8080"

The compressed size is 25.2MiB , but how would that help if the above error seems to complain about 6291424 bytes ~ 6 MiB

hmm reading on docs, `6MiB` is some kind of a default

so on https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-batch-transform.html , the data is split out by default.
Ah and a little bit of online searching shows this is actually an nginx error. Indeed I see that my Docker nginx.conf uses client_max_body_size 5m;
Okay. I changed it to 8m and I will retry. But that means I need to update my csv df hydrate function to be headerless.

ok that worked.

However now there is some more work to do, since the 201603-citibike-tripdata.csv 170MiB input file was split into batches of 8MiB, and the outputs are json of the form, {"confusion_matrix": [[0, 0,...],[],...]}, so the output which Batch has saved is a concatenation of these,
```
{"confusion_matrix": [[0, 0,...],[],...]}{"confusion_matrix": [[0, 0,...],[],...]}{"confusion_matrix": [[0, 0,...],[],...]}{"confusion_matrix": [[0, 0,...],[],...]}
```
```
$ ag --count 'confusion'  ~/Downloads/201603-citibike-tripdata.csv.out
29
```
Option 1 , so to interpret this I have to sum all the confusion matrices to make a confusion matrix for my original input data.
Option 2 , I perhaps just plot the results over time, just to get a feel that is more granular, over several months.

namoopsoo commented 5 years ago

look at a few of the months

import json

def unpack_oneliners(concatenated):
    elements = concatenated[1:-1].split('}{')
    normalized = ['{' + x + '}' for x in elements]
    return [json.loads(x) for x in normalized]

def unpack_multiliners(concatenated):
    elements = concatenated.split('\n')
    elements_no_empties = [x for x in elements if x != '']
    return [json.loads(x) for x in elements_no_empties]

def unpack_concatenated(concatenated):
    if '}{' in concatenated:
        return unpack_oneliners(concatenated)
    elif '\n' in concatenated:
        return unpack_multiliners(concatenated)
    raise Exception('Shouldnt be here')

def read_file(fn):
    with open(fn) as fd:
        return fd.read()

local_filenames = [
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201603-citibike-tripdata.csv.out',
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201604-citibike-tripdata.csv.out',
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201605-citibike-tripdata.csv.out',
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201606-citibike-tripdata.csv.out',
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201607-citibike-tripdata.csv.out',
        '/Users/michal/Downloads/2018-10-21-comparingmodels/2018-12-24--batch-results/201608-citibike-tripdata.csv.out',]

batch_outputs = [read_file(x) for x in local_filenames]
unpacked = [unpack_concatenated(x) for x in batch_outputs]
rank_proba_scores = [
        [x.get("rank_k_proba_scores").get('10') for x in vec]
        for vec in unpacked]

monthly_means = [np.mean(vec) for vec in rank_proba_scores]

rank_k_proba_scores = [x.get("rank_k_proba_scores") for x in unpack_concatenated(concatenated_data)]

len(rank_k_proba_scores)
[x.get('10') for x in rank_k_proba_scores]

Interestingly, this doesn't seem to be degrading. I Should look deeper for possible scoring problems.

In [127]: zip([x.split('/')[-1] for x in local_filenames], monthly_means)
Out[127]: 
[('201603-citibike-tripdata.csv.out', 0.6159115590305551),
('201604-citibike-tripdata.csv.out', 0.6086896594073963),
('201605-citibike-tripdata.csv.out', 0.5982699463742136),
('201606-citibike-tripdata.csv.out', 0.5963199010718998),
('201607-citibike-tripdata.csv.out', 0.6042312393820078),
('201608-citibike-tripdata.csv.out', 0.5854122761160239)]

I can also look at the spread factor of the confusion matrices in these outputs as well.

namoopsoo commented 5 years ago

Looks like the same neighborhood class is like the default of sorts, for at least each of the mini batches used in the `unpacked[-1]` , or the last Batch output , `201608`.

Not exactly sure what that means that the default is the same class.

print [sorted(zip(range(64), xconfdf.sum()), key=lambda x:x[1])[-1] for xconfdf in [pd.DataFrame.from_records(y.get('confusion_matrix')) for y in unpacked[-1]]]

In [149]: print [sorted(zip(range(64), xconfdf.sum()), key=lambda x:x[1])[-1] for xcon
 ...: fdf in [pd.DataFrame.from_records(y.get('confusion_matrix')) for y in unpack
 ...: ed[-1]]]
[(9, 28625), (9, 27594), (9, 27853), (9, 27147), (9, 27642), (9, 27510), (9, 27963), (9, 27361), (9, 26306), (9, 24850), (9, 23930), (9, 27256), (9, 27718), (9, 27702), (9, 27121), (9, 27274), (9, 27542), (9, 27119), (9, 26991), (9, 25190), (9, 24916), (9, 26978), (9, 27208), (9, 27420), (9, 26936), (9, 27435), (9, 27285), (9, 26947), (9, 26673), (9, 25079), (9, 24164), (9, 25477), (9, 27021), (9, 26625), (9, 27119), (9, 26689), (9, 26947), (9, 26757), (9, 26867), (9, 26578), (9, 24042), (9, 23567), (9, 23287), (9, 26644), (9, 26561), (9, 26938), (9, 26166), (9, 26878), (9, 6395)]

which class is `9`?

import cPickle
import bikelearn.classify as blc

fn = '/Users/michal/Downloads/2018-12-07-update-model/2018-12-07-update-model/tree-foo-bundle-pensive-swirles.2018-12-04T210259ZUTC.pkl'

with open(fn) as fd: bundle = cPickle.load(fd)

# label_encoder for the end neighborhood..

blc.label_decode(bundle['label_encoders']['end_neighborhood'], [9])

# this does the , label_encoder.inverse_transform(vec)

==>


In [159]: blc.label_decode(bundle['label_encoders']['end_neighborhood'], [9])
/usr/local/miniconda3/envs/citilearnsage/lib/python2.7/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
Out[159]: array(['Central Park'], dtype=object)


* What the heck `Central Park` heh. can this really be true? Pretty weird.

namoopsoo commented 5 years ago

Thoughts on next steps

I have done an evaluation on '201603 through 201608 data. I can, also do the evaluation on 201609 through 201712 . There should damn well be some kind of deterioration, since at least because many new stations should be missing from the input data? This perhaps though would mean I need to update the station csv? Not sure. The functionality needs to be present such that when taking a dataset which has stations which are not in the stations file, that the data is hmm encoded as -1 right, as missing?
Another path is having a better live demo i can show off
Also a better write up, given the SageMaker experimentation.
And of course, if I can say the validation process is good enough that I have, then I would like to try other algorithms and see how they validate. And can I improve my baseline.

namoopsoo commented 5 years ago

using the 201610 ... and beyond data.

(citilearnsage) $ head  ../data/citibike/201610-citibike-tripdata.csv
Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
328,2016-10-01 00:00:07,2016-10-01 00:05:35,471,Grand St & Havemeyer St,40.71286844,-73.95698119,3077,Stagg St & Union Ave,40.70877084,-73.95095259,25254,Subscriber,1992,1
398,2016-10-01 00:00:11,2016-10-01 00:06:49,3147,E 85 St & 3 Ave,40.77801203,-73.95407149,3140,1 Ave & E 78 St,40.77140426,-73.9535166,17810,Subscriber,1988,2
430,2016-10-01 00:00:14,2016-10-01 00:07:25,345,W 13 St & 6 Ave,40.73649403,-73.99704374,470,W 20 St & 8 Ave,40.74345335,-74.00004031,20940,Subscriber,1965,1
351,2016-10-01 00:00:21,2016-10-01 00:06:12,3307,West End Ave & W 94 St,40.7941654,-73.974124,3357,W 106 St & Amsterdam Ave,40.8008363,-73.9664492472,19086,Subscriber,1993,1
2693,2016-10-01 00:00:21,2016-10-01 00:45:15,3428,8 Ave & W 16 St,40.740983,-74.001702,3323,W 106 St & Central Park West,40.7981856,-73.9605909006,26502,Subscriber,1991,1
513,2016-10-01 00:00:28,2016-10-01 00:09:02,433,E 13 St & Avenue A,40.72955361,-73.98057249,151,Cleveland Pl & Spring St,40.722103786686034,-73.99724900722504,25800,Subscriber,1995,1
601,2016-10-01 00:00:51,2016-10-01 00:10:52,3314,W 95 St & Broadway,40.7937704,-73.971888,3374,Central Park North & Adam Clayton Powell Blvd,40.799484,-73.955613,15985,Subscriber,1972,2
563,2016-10-01 00:00:54,2016-10-01 00:10:18,453,W 22 St & 8 Ave,40.74475148,-73.99915362,485,W 37 St & 5 Ave,40.75038009,-73.98338988,26018,Subscriber,1984,1
439,2016-10-01 00:00:54,2016-10-01 00:08:13,534,Water - Whitehall Plaza,40.70255065,-74.0127234,360,William St & Pine St,40.70717936,-74.00887308,15374,Subscriber,1968,1
(citilearnsage) $ 
(citilearnsage) $ head  -2 ../data/citibike/201605-citibike-tripdata.csv
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"
"538","5/1/2016 00:00:03","5/1/2016 00:09:02","536","1 Ave & E 30 St","40.74144387","-73.97536082","497","E 17 St & Broadway","40.73704984","-73.99009296","23097","Subscriber","1986","2"

namoopsoo commented 5 years ago

Quick recap

First run of 201603 failed, with error ,

2018/12/09 21:38:02 [error] 9#9: *4 client intended to send too large body: 6291424 bytes, client: 169.254.255.130, server: , request: "POST /invocations HTTP/1.1", host: "169.254.255.131:8080"

Then after increasing the Docker nginx.conf from client_max_body_size 5m; to client_max_body_size 8m; , trying 201603 again worked.
My job on 201609 failed , pensive-swirles-job-201609 , because of the error ClientError: S3 key: s3://my-sagemaker-blah/bikelearn/datasets/uncompressed/201609-citibike-tripdata.csv matched no files on s3 , so that one was a simple fix ,

Subsequently, the error on pensive-swirles-job-201610 , 201611 and beyond failed with the error ,


169.254.255.130 - - [24/Dec/2018:04:20:48 +0000] "POST /invocations HTTP/1.1" 500 291 "-" "Go-http-client/1.1"
ValueError: time data '2016-10-01 00:00:07' does not match format '%m/%d/%Y %H:%M:%S'

* Because the file formats on `201610`, `201611`, `201612` as far as i can tell have changed the date formats. 
* I corrected the Docker code reading dates from `pensive-swirles-2-11` to `pensive-swirles-2-12` and the problem went away.

#### But furthermore...
* I then automated the batch transform w/ the API , but the first time around job `Batch-Transform-2019-01-06-233207` failed because i forgot to include my special ENVIRONMENT variable, `DO_VALIDATION` for indicating  validation. 
* I did add this for the job `Batch-Transform-2019-01-06-235825` and that one was successful.

#### but then the log ...
* for `Batch-Transform-2019-01-06-235833` , has a mix bag of success and also timeouts,

2019/01/07 00:02:10 [error] 9#9: *10 upstream timed out (110: Connection timed out) while sending request to upstream, client: 169.254.255.130, server: , request: "POST /invocations HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock/invocations", host: "169.254.255.131:8080"


* And i dont see the typical output file generated for this on s3, so i dont think this finished.

namoopsoo commented 5 years ago

batch transform, automated since theres so many batch transforms i wanted to run

import upload_client as uc

inputs = ['201611-citibike-tripdata.csv.gz', '201612-citibike-tripdata.csv.gz', '201701-citibike-tripdata.csv.gz']
for inputfile in inputs:
    print 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfile
for inputfile in inputs:
    path = 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfile
    print path
    print uc.start_batch_transform_job(input_location=path)
    import time
    time.sleep(2)


In [17]: reload(uc)
Out[17]: <module 'update_client' from 'update_client.py'>

In [18]: inputs = ['201611-citibike-tripdata.csv.gz', '201612-citibike-tripdata.csv.gz
    ...: ', '201701-citibike-tripdata.csv.gz']

In [19]: for inputfile in inputs:
    ...:     print 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfile
    ...: 
    ...:     
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201611-citibike-tripdata.csv.gz
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201612-citibike-tripdata.csv.gz
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201701-citibike-tripdata.csv.gz

In [20]: for inputfile in inputs:
    ...:     path = 's3://my-sagemaker-blah/bikelearn/datasets/compressed/' + inputfil
    ...: e
    ...:     print path
    ...:     print uc.start_batch_transform_job(input_location=path)
    ...:     import time
    ...:     time.sleep(2)
    ...:     
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201611-citibike-tripdata.csv.gz
batch_job_name , Batch-Transform-2019-01-07-001813
created transform job with name:  Batch-Transform-2019-01-07-001813
None
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201612-citibike-tripdata.csv.gz
batch_job_name , Batch-Transform-2019-01-07-001815
created transform job with name:  Batch-Transform-2019-01-07-001815
None
s3://my-sagemaker-blah/bikelearn/datasets/compressed/201701-citibike-tripdata.csv.gz
batch_job_name , Batch-Transform-2019-01-07-001818
created transform job with name:  Batch-Transform-2019-01-07-001818
None

In [21]:

namoopsoo / learn-citibike

SageMaker model retrain #16

General plan

make python package ..

Docker build..

Run serve endpoint

Make datasets ,

Test running the train job from inside the container,

small snafu with zip code being perceived as a float for some reason...

ok yay, got a model trained .

ok do another train, now cleaner

training..

test..

next step: make the docker image also do model invocation

interactive

serve ..

try to push to repository now...

Quick notes

kept getting a ValidationException when creating a new model custom SageMaker model

..

the basic POC model only returning 8?! Ruh ro.

Okay, looking at a hundred examples randomly, but still getting just 8.

Need a deeper dive.

wait a second, what the heck, it's even worse on the stations_df on the bundle

okay, after fixing the stations_df , that particular nan problem is fixed!

print data, call_api(data, url, headers).json()

-- End pasted text --

ok cool. after updating stations_df, now got rid of the nans problem

-- End pasted text --

more metrics...

still seem to be seeing 'nan' calculated in the proportion correct data..

looking at confusion matrix, basically makes this nearly obvious,

adding a new assert just in case..

ok try to get an update for stations..

..looking at my station data... its got some not ideal stuff in there...

deleted the incomplete ones. will try that agin .

whoops

having this result, where intersection i put in is recognized as a street

after deleting and retrying, results were crappy again, but i might know why..

tracked down an edit in git log ,

Update code and try again...

Hmm, but still problematic.

The station gathering code ... , also,

station lat long also changed. darn dedupe still other concerns then.

dedupe differently

previously missing when training few weeks ago..

another re-train then.

print one of the trees..

ah darn batch transform size limitations?

hmm reading on docs, 6MiB is some kind of a default

ok that worked.

look at a few of the months

Looks like the same neighborhood class is like the default of sorts, for at least each of the mini batches used in the unpacked[-1] , or the last Batch output , 201608.

which class is 9?

Thoughts on next steps

using the 201610 ... and beyond data.

Quick recap

batch transform, automated since theres so many batch transforms i wanted to run

kept getting a `ValidationException` when creating a new model custom SageMaker model

Okay, looking at a hundred examples randomly, but still getting just `8`.

okay, after fixing the stations_df , that particular `nan` problem is fixed!

ok cool. after updating stations_df, now got rid of the `nans` problem

still seem to be seeing `'nan'` calculated in the proportion correct data..

hmm reading on docs, `6MiB` is some kind of a default

Looks like the same neighborhood class is like the default of sorts, for at least each of the mini batches used in the `unpacked[-1]` , or the last Batch output , `201608`.

which class is `9`?