sfbrigade / datasci-sba

Solving problems with the Small Business Administration
10 stars 18 forks source link

update yelp script #48

Closed zlatankr closed 7 years ago

zlatankr commented 7 years ago

1. Brief Summary of what this PR accomplishes (140 characters or less. If you find trouble describing what you are doing in this length, consider breaking the PR into multiple ones.)

Cleaned up Yelp script that scrapes Yelp data and pushes it into a new table in our PostgreSQL database.

2. Link to Trello Ticket

https://trello.com/c/1JiYtVmg

3. More detailed description and other questions to address in code review

I ran the code outside of the function, so we need to make sure that this script can successfully run in the pipeline. Additionally, I combine the yelp data with the sfdo data and push into a new table, but maybe there's a better way to do it?

Need to add yelp credentials (see slack chat) to environment variables....

4. Remember to tag reviewers! @VincentLa

VincentLa14 commented 7 years ago

Ok, made some "seemingly" big changes, but the core of it is still the same. Here's what I did:

  1. Within api_calls directory, are now modules that serve as helper functions. Each of these files will contain the relevant functions for accessing the respective APIs. For example, yelp_ratings.py hits the Yelp API to get Yelp ratings, congressional_districts.py hits the Google Civic API to get Congressional Districts.
  2. These then will all get called in 00_01_03_sba_sfdo_api_calls.py and then the results are new fields that get written to stg_analytics.sba_sfdo_api_calls
  3. I created a surrogate primary key sba_sfdo_id which we can then use to join back to original sba_sfdo data.
  4. The final table in the sequence is stg_analytics.sba_sfdo_all.

I tested against local database truncated to just 100 rows. Looks like it works/runs successfully, but the success rate of finding Yelp reviews seems pretty low on the first 100 rows (maybe hit 5 businesses/yelp reviews?)

Running on production data now; we'll see how many hits we get. As you mention in the code, I think a big part of it is due to bad address normalization.