titipata / yelp_dataset_challenge

Play around with Yelp dataset in Python (in progress and very messy repo)
http://www.yelp.com/dataset_challenge
19 stars 6 forks source link
pandas yelp-dataset

Yelp Dataset Challenge for Python

Repository for reading and downloading Yelp Dataset Challenge round 6 in Pandas pickle format. This repository makes it easy for anyone who want to mess around with Yelp data using Python. I provide yelp_util Python package that has read and download function.

Datasets repository

The following is structure of S3,

science-of-science-bucket
└─yelp_academic_dataset
  ├───yelp_academic_dataset_business.pickle (61k rows)
  ├───yelp_academic_dataset_review.pickle (1.5M rows)
  ├───yelp_academic_dataset_user.pickle (366k rows)
  ├───yelp_academic_dataset_checkin.pickle (45k rows)
  └───yelp_academic_dataset_tip.pickle (495k rows)

You can download data directly from AWS S3 repository as follows,

import yelp_util
yelp_util.download(file_list=["yelp_academic_dataset_business.pickle",
                              "yelp_academic_dataset_review.pickle",
                              "yelp_academic_dataset_user.pickle",
                              "yelp_academic_dataset_checkin.pickle",
                              "yelp_academic_dataset_tip.pickle"])

The file will be downloaded to data folder. After finishing download, you can simply read pickle as follows

import pandas as pd
review = pd.read_pickle('data/yelp_academic_dataset_review.pickle')
review.head()

Structure of Datasets

User table of user's information (366k rows)

average_stars compliments elite fans friends name review_count type user_id votes yelping_since

Business table of business with its location and city that it locates (61k rows)

attributes business_id categories city full_address hours latitude longitude name neighborhoods open review_count stars state type

Review reviews made by users (1.5M rows)

business_id date review_id stars text type user_id type votes_cool votes_funny votes_useful

Checkin check-in table (45k rows)

business_id checkin_info type

Tip tip table (495k rows)

business_id date likes text type user_id

Cluster businesses according to how they are tagged

Read the business data

from sklearn.cluster import KMeans

business = pd.read_pickle('data/yelp_academic_dataset_business.pickle')
tags = business.categories.tolist()

then transform tags to matrix count

tag_countmatrix = yelp_util.taglist_to_matrix(tags)

This can be used to cluster businesses

from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(tag_countmatrix)
business['cluster'] = km.predict(tag_countmatrix)

Train word2vec model

review = pd.read_pickle('data/yelp_academic_dataset_review.pickle')
yelp_review_sample = list(review.text.iloc[10000:20000])
model = yelp_util.create_word2vec_model(yelp_review_sample) # word2vec model

Django runserver

All django project is in random_reviews folder. Get started by running python manage.py migrate. Then for local computer (main aim is to custom css files) run Django project by using python manage.py runserver

Dependencies

Members