Citibike Project: Can your Destination be Predicted


I think sometimes the most interesting projects live behind ideas that sound impractical or even crazy. That's why I thought it would be fun to use the Citibike bike share trip data to try and predict a person's destination based on what we know.

Roughly speaking trip data looks like this

The data if fairly clean and regular, so I thought this was a fun data set to sharpen my teeth on.

Quick Bird's Eye of my Journey

More on this data

df = load_data('data/201509_10-citibike-tripdata.csv.annotated.100000.06112016T1814.csv')

Speed and Age

Turns out that you need to know the miles per the longitude degree at a particular latitude on our planet. So for our particular location, at lat around 40.723 and using the earth's radius of about 3958 miles , we have about 52.3 miles/longitude degree here in NYC.

So from there, looking at some of the speed data just involved looking at the tripdata trip time and calculating the cartesian distance. <img src="" width="435" height="307" > 1758 × 1238

(More on the code here ) (Also more detail on this analysis in the main jupyter notebook)

Need additional location data

The meat of the output can look like this, for the docking station "1st Avenue & E 15th St"

<img src="" width="457" height="328">

Time bucketing

In order to get better information from the source time, the source time was bucketted into 24 hour-buckets per day. That is since a ride starting at 1:04:23pm shouldn't be treated as being too different from a ride departing at 1:05:24pm . There is more value in intuitively clustering the rides.

Comparing Geolocation Granularities

There are about 463 stations found in the dataset, 28 neighborhoods, representing 49 postal codes and 3 out of 5 boroughs,

So using the (start time bucket, start station id, age, gender) as the inputs and with RandomizedLogisticRegression as a classifier , for about a months worth of trip data, I saw roughly the following comparison.

Deeper into the weeds

I compared the SGDClassifier with the LogisticRegression classifier (which I believe just uses Gradient Descent, while the SGDClassifier classifier is also a Logistic Regression classifier, but it uses Stochastic Gradient Descent).

I also tried applying Standard Scaling to my input data after reading that scikit learn 's SGDClassifier implementation is sensitive unless the input data has a mean=0 and variance=1 . Indeed per the below this helped a little.

<img src="" width="637" height="302">

I also applied a GridSearch around the alpha parameter to the SGDClassifier, but this did not help at least the way I tried it,

<img src="" width="690" height="450">

I next started varying the training set size, given that a month worth of trip data had about a million rows, I went from 10k to 1M,

<img src="" width="678" height="235">

But this didn't look great. I realized a problem I had was that I was not randomly sampling my input data. Since a month-size dataset is around 1.2 Million rows, then a 10,000 large set just ends up barely dipping into the first day. So choosing the dataset sizes has to be done, by random sampling.

After making the sampling randomized, the output below, feels like it has a better upward trend, but it is still not visible enough.

<img src="" width="682" height="235">

I also realized I was being inconsistent in my assessment because I was not using a single holdout set to test. I was actually randomly generating a test set each time. That was really bad. So I created ten models on sizes 10,000 through 100,000 datasets, created from 09 and 10 2015, and testing on a single holdout dataset, taken from November 2015. In this approach, the accuracy results are found using the same holdout set instead of using a differently derived test set each time.

<img src="" width="673" height="232"> Although the results were still pretty flat, at least I can trust the consistency of my test method more now.

More in the jupyter notebook

Binarizing the inputs

<img src="" width="445" height="310">

More details in the notebook

Changing my metric one more time

<img src="" width="415" height="67">

The overall reasoning I had here is two-fold. One, I think of the analogy of a search engine, where it is typically acceptable to show someone five results as opposed to the so called "I am feeling lucky" result. Of course not every machine learning use case will have the tolerance to take five results as opposed to five, but I think my particular problem of choice it might be fine. Or at least asking people would help to answer that question.

But I think the main reason I wanted to do this was to just better understand whether my classification approach was doing anything at all. So since, out of these 28 or so neighborhoods, if going from the top 1 result to the top 2 results, yields an additional 20 points of accuracy, then I feel a little better about the result making sense.

<img src="" width="334" height="239">

More detail in the notebook

SageMaker approach

I wanted to test drive AWS SageMaker ,

What ended up happening

Changes to Model Iterations

To make things slightly easier to understand, especially for debugging purposes, I now made models into json objects that are easier to display

Bad data strikes again

Future Improvements