timgasser / bcycle-austin

Austin BCycle Data Analysis project
MIT License
3 stars 1 forks source link

Austin BCycle Analysis

This github repo contains all the code and data used in the 3-part BCycle analysis posted on Medium. To read the posts, click on the links below.

Directory hierarchy

Environment Setup

I'm using the Anaconda Python distribution. This is a great all-in-one distribution which sets up all the Python packages, Jupyter notebooks, and command-line interfaces you need.

If you're using Anaconda, I also saved out an environment file. This contains all the packages you need to run the notebooks. To create the environment, run the command below.

$ conda env create -f bcycle_env.yml

Quickstart Guide

The CSV files in the input directory have been checked into git, so once you clone the repo you can extract them and start running notebooks, and your own analysis.

To uncompress the CSV files, use the following commands:

$ cd input
$ unzip '*.zip'
Archive:  bikes.csv.zip
  inflating: bikes.csv               

Archive:  stations.csv.zip
  inflating: stations.csv            

2 archives were successfully processed.

The contents of the input directory should now be:

$ ls -l
total 49896
-rw-r--r--  1 tim  staff    22M Oct 20 20:32 bikes.csv
-rw-r--r--  1 tim  staff   2.5M Nov 12 17:39 bikes.csv.zip
-rw-r--r--  1 tim  staff   5.2K Oct 20 20:32 stations.csv
-rw-r--r--  1 tim  staff   1.8K Nov 12 17:39 stations.csv.zip

Now the CSV files are ready, open up any of the notebooks in the notebooks subdirectory and you should be good to go !

Full Guide

If you'd like to run all the steps of the data pipeline, follow the steps below.

Downloading raw HTML

To download the raw HTML, you can run the shell script below. This downloads the zipped tarball from my Dropbox area, and unzips into an html subdirectory.

$ cd data
$ ./get_data.sh

You should see the file being downloaded, and then HTML files unzipping into the html subdirectory.

Converting HTML to CSV files

This step uses the clean_data.py script in the scripts subdirectory to process the HTML files. The input files are all taken from data, and written out to input.

To do the conversion, follow the instructions below. The script uses the tqdm package to show a progress bar as the HTML files are converted.

$ cd scripts
$ python clean_data.py 
 20%|███████▌                             | 3562/17504 [00:14<00:53, 260.78it/s]

Once this completes, all the CSV files will be ready to go in the input directory.

Notebooks

The notebooks are split into those which use the 2-months of data I scraped, and the ones that use the full 3-year dataset from AUstin BCycle. The ones using the full dataset have bcycle_all_data in their title. The others run with the csv files supplied in zip format in Git.

April/May 2016 notebooks

Please make sure you unzip the zip files in the input directory before running these notebooks.

Full data notebooks

The data needed for these notebooks has to be requested from Austin BCycle. You can still browse these, and see the results though.