intake / intake-datasets

Catalogs, data packages and resources for Intake
0 stars 1 forks source link

Yelp dataset. #3

Open gregorylivschitz opened 5 years ago

gregorylivschitz commented 5 years ago

I think adding the yelp dataset would be very helpful. From my experience many professors use it data analytics classes, programming for data classes, machine learning classes ect....

Here is the link: https://www.yelp.com/dataset/challenge

I think being able to easily intake this dataset would go a long way in making intake a more popular library. There are a few problems with the dataset, you need to agree to the terms and services and provide a name and email address before downloading this dataset. We could have users just provide it user env variables in the dataset, sort of like how auth works, then we can just have a driver make a post request.

martindurant commented 5 years ago

Do you know the details of how to list datasets and how to provide those auth fields in a HTTP header (is it a cookie?)?

gregorylivschitz commented 5 years ago

@martindurant

I had a little bit of time today, so I wrote a script to download it, obviously this is just a quick and dirty example. It does take a while mostly because the only way to get it is by a tar file that is 3 gigs.

Unfortunately it looks like first you need to get the crsf_token because it needs to get passed in the request.

We can also get it from keggle? https://www.kaggle.com/yelp-dataset/yelp-dataset

But I would rather get it from the source and if we go with keggle it looks like we would still need the user to pass an api key I haven't looked into it too much though.

Anyway all this needs from the user is their name, email, and signature(initials) as variables you need to pass it in the form during the post not in headers since it isn't really auth and they need to be aware that they are agreeing to license agreement yelp provides.

There isn't a way to list them, from the looks of it unless we want to parse the docs on the dataset? https://www.yelp.com/dataset/documentation/main

import requests
from bs4 import BeautifulSoup

def get_csrf_token(session):
    response_get = session.get('https://www.yelp.com/dataset/download')
    soup = BeautifulSoup(response_get.content, 'html.parser')
    form = soup.find(id="dataset_form")
    csrf_token = form.find(class_='csrftok')['value']
    return csrf_token

def get_yelp_download_page(session, name, email, signature, csfr_token):
    payload = {'name': name, 'email': email, 'signature': signature,
               'csrftok': csfr_token,
               'terms_accepted': 'y'}
    response_post = session.post('https://www.yelp.com/dataset/download', data=payload)
    soup = BeautifulSoup(response_post.content, 'html.parser')
    # unfortunately it looks like yelp is not a fan of id's or none formatting classes so we use strings.
    yelp_data_href = soup.find('a', string='Download JSON')['href']
    return yelp_data_href

def download_yelp_data_to_file(session, yelp_data_href):
    print("starting to download")
    response_yelp_data = session.get(yelp_data_href, stream=True)
    # It's pretty big 3gigs so let's just read it into a file for this example.
    with open('yelpdata.tar', 'wb') as f:
        for chunk in response_yelp_data.iter_content(chunk_size=25000000):
            if chunk:
                print("writing chunk")
                f.write(chunk)
    print("finished downloading")

def download_yelp():
    # need to get these from user
    name = 'fake_name'
    email = 'fake_email9369283709@gmail.com'
    # this is just initials to agree to license
    signature = 'fn'
    session = requests.Session()
    csrf_token = get_csrf_token(session)
    yelp_download_page = get_yelp_download_page(session, name, email, signature, csrf_token)
    download_yelp_data_to_file(session, yelp_download_page)

if __name__ == '__main__':
    download_yelp()