jamesward / koober

336 stars 105 forks source link

Generate Demo Data Set for Rides #3

Closed jamesward closed 7 years ago

jamesward commented 7 years ago

A ride consists of:

In order to do #1 we need a demo data set we can feed into PredictionIO. The data set can't be totally random otherwise our predictions might appear random. So maybe there is public taxi data we can use.

anniexcheng commented 7 years ago

This NYC Cab data set looks promising: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

jamesward commented 7 years ago

Cool. And I realized we don't really need the driver data for predicting demand, just the "request" or pickup data.

adamwhat commented 7 years ago

I'm making a sample of 2015 NYC Cab data set as the original data set is way too large for development. We can use the full data set to train the model once the whole pipeline is complete.

adamwhat commented 7 years ago

Sample Data Download

The sample is a 0.1% of the random sample with 1388068 entries from 2015 yellow cab dataset.

Here is the HDS5 version of the same data. You can use deepdish to load the data as a Pandas DataFrame.

jamesward commented 7 years ago

Fixed by #4

adamwhat commented 7 years ago

https://s3-us-west-2.amazonaws.com/4740/yellow_tripdata_2015_further_sample.csv.zip

sample of 10000 lines of taxi data 2015