sfbrigade / datasci-sba

Solving problems with the Small Business Administration
10 stars 18 forks source link

updated geocoder #25

Closed jpspeng closed 6 years ago

jpspeng commented 7 years ago

1. Brief Summary of what this PR accomplishes (140 characters or less. If you find trouble describing what you are doing in this length, consider breaking the PR into multiple ones.)

Adding Geocoder

2. Link to Trello Ticket

https://trello.com/c/wbM0HXeG/6-geocode-addresses-in-dataset

3. More detailed description and other questions to address in code review

Adding in geocoder for the data set. This will add latitude and longitude for each loan where we have a valid address. This may also be dependent on an address normalizer.

4. Remember to tag reviewers! @avdonovan @makfan64

VincentLa14 commented 7 years ago

Thanks @jpspeng for putting this together! I'll take a look

VincentLa14 commented 7 years ago

@jpspeng as you're working with Git and Github can you add to the onboarding doc? https://github.com/sfbrigade/datasci-sba/blob/master/onboarding/03_tips_and_tricks.md

jpspeng commented 7 years ago

Finished geocoding code using geopy and updated the notebook

VincentLa commented 7 years ago

Any chance you ran it on the whole database table?

jpspeng commented 7 years ago

I could...How do I store the new dataframe?

Also, we could standarize most of the addresses by inputting into the geocoder, and returning a new column "Standarized Address" that has the Google-standarized address.

On Tue, Jul 18, 2017 at 8:51 AM, Vincent La notifications@github.com wrote:

Any chance you ran it on the whole database table?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sfbrigade/datasci-sba/pull/25#issuecomment-316108543, or mute the thread https://github.com/notifications/unsubscribe-auth/ARc4ITJfnD5BCoLeKSb_ElCLPOlsvskTks5sPNR7gaJpZM4Oa2Pl .

VincentLa commented 7 years ago

You have to write to the DB. Check out some of the python code in the pipeline_runner directory. I think we will eventually want to put it in there. I can show you how that works, it might take some explaining it's nontrivial.

But the short answer is check out the pandas.to_sql function. But in the pipeline code, I actually define my own class DBManager to help manage connections to DB

Re: standardized addresses -- that sounds like a great strategy.

jpspeng commented 7 years ago

ok will look through it

On Tue, Jul 18, 2017 at 9:01 AM, Vincent La notifications@github.com wrote:

But in the pipeline code, I actually define my own class DBManager to help manage connections to DB

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sfbrigade/datasci-sba/pull/25#issuecomment-316111919, or mute the thread https://github.com/notifications/unsubscribe-auth/ARc4IXZoZ6AZr7mIyrgzmbthe6v1CAPNks5sPNbkgaJpZM4Oa2Pl .

VincentLa14 commented 7 years ago

@jpspeng just checking in to see if you had other questions?

VincentLa14 commented 7 years ago

Next step will be to write results to DB. See https://github.com/sfbrigade/datasci-sba/blob/master/pipeline/pipeline_tasks/parse/00_01_01_load_sba_datasets.py as an example. Maybe schema name is api_calls?

VincentLa14 commented 7 years ago

I added a doc for running the pipeline_runner here: https://github.com/sfbrigade/datasci-sba/blob/master/pipeline/README.md

makfan64 commented 7 years ago

@avdonovan see Pull Request 59 for a proposal to deal with the limits for Yelp and Google Maps/Google Civic. It's just a proposal at this point.

avdonovan commented 7 years ago

@makfan64 awesome, thanks for pointing me to that other PR. I'll add some comments there and let's discuss on Wednesday.

VincentLa14 commented 6 years ago

@avdonovan oh shoot didn't see these comments until now!

VincentLa14 commented 6 years ago

I'm going to merge this in now actually since it's mostly in a good state and there's a lot of conversation here already. We can deal with the scheduling of tasks in a different PR.