Closed jpspeng closed 6 years ago
Thanks @jpspeng for putting this together! I'll take a look
@jpspeng as you're working with Git and Github can you add to the onboarding doc? https://github.com/sfbrigade/datasci-sba/blob/master/onboarding/03_tips_and_tricks.md
Finished geocoding code using geopy and updated the notebook
Any chance you ran it on the whole database table?
I could...How do I store the new dataframe?
Also, we could standarize most of the addresses by inputting into the geocoder, and returning a new column "Standarized Address" that has the Google-standarized address.
On Tue, Jul 18, 2017 at 8:51 AM, Vincent La notifications@github.com wrote:
Any chance you ran it on the whole database table?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sfbrigade/datasci-sba/pull/25#issuecomment-316108543, or mute the thread https://github.com/notifications/unsubscribe-auth/ARc4ITJfnD5BCoLeKSb_ElCLPOlsvskTks5sPNR7gaJpZM4Oa2Pl .
You have to write to the DB. Check out some of the python code in the pipeline_runner
directory. I think we will eventually want to put it in there. I can show you how that works, it might take some explaining it's nontrivial.
But the short answer is check out the pandas.to_sql
function. But in the pipeline code, I actually define my own class DBManager to help manage connections to DB
Re: standardized addresses -- that sounds like a great strategy.
ok will look through it
On Tue, Jul 18, 2017 at 9:01 AM, Vincent La notifications@github.com wrote:
But in the pipeline code, I actually define my own class DBManager to help manage connections to DB
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sfbrigade/datasci-sba/pull/25#issuecomment-316111919, or mute the thread https://github.com/notifications/unsubscribe-auth/ARc4IXZoZ6AZr7mIyrgzmbthe6v1CAPNks5sPNbkgaJpZM4Oa2Pl .
@jpspeng just checking in to see if you had other questions?
Next step will be to write results to DB. See https://github.com/sfbrigade/datasci-sba/blob/master/pipeline/pipeline_tasks/parse/00_01_01_load_sba_datasets.py as an example. Maybe schema name is api_calls
?
I added a doc for running the pipeline_runner here: https://github.com/sfbrigade/datasci-sba/blob/master/pipeline/README.md
@avdonovan see Pull Request 59 for a proposal to deal with the limits for Yelp and Google Maps/Google Civic. It's just a proposal at this point.
@makfan64 awesome, thanks for pointing me to that other PR. I'll add some comments there and let's discuss on Wednesday.
@avdonovan oh shoot didn't see these comments until now!
I'm going to merge this in now actually since it's mostly in a good state and there's a lot of conversation here already. We can deal with the scheduling of tasks in a different PR.
1. Brief Summary of what this PR accomplishes (140 characters or less. If you find trouble describing what you are doing in this length, consider breaking the PR into multiple ones.)
Adding Geocoder
2. Link to Trello Ticket
https://trello.com/c/wbM0HXeG/6-geocode-addresses-in-dataset
3. More detailed description and other questions to address in code review
Adding in geocoder for the data set. This will add latitude and longitude for each loan where we have a valid address. This may also be dependent on an address normalizer.
4. Remember to tag reviewers! @avdonovan @makfan64