Pre-processing for this NYC traffic dataset: https://databank.illinois.edu/datasets/IDB-9610843 See also: https://lab-work.github.io/data/
Processed data format is targeting the Spatial Temporal Dynamics Network.
(See: https://github.com/tangxianfeng/STDN and https://arxiv.org/abs/1803.01254)
Chances are, unless you are replicating the (yet unpublished) paper this is for, this is not the code you're looking for! This is not a general-purpose Python library.
This is a command-line Python3 program used to preprocess data from the aforementioned dataset. It splits the island of Manhattan into a w*h grid and discritizes information about taxi into n slots per hour.
Warning 1: With the default parameters, this code saves ~50GB of data (~1GB per array). (This is 15M per array compressed.) The 'flow' array takes the most space, roughly w^2 h^2 n * 5.7 KB of data. By changing the parameters from the defaults (w=10, h=20, n=4) to w=5, h=10, n=2, the total space required drops to ~2GB.
Warning 2: Because of the large sizes of the files, data is processed per-month. Some trips start in one month and end in another (e.g. February 28th 2011 to March 1st 2011). This means, if you are starting or restarting data processing (e.g. on April 2013) then you need to set the start month to the previous month (e.g. March 2013) and run with the --restart flag.
Warning 3: Because there is are many errors in the data, some entries are discarded. See utils.check_valid() to see the rules for discarding entries. Entries are discarded if their start times are erroneous or if their trip straight-line (l2) distance and/or delta-t are nonsensical (too short or too fast).
Warning 4: We sample with a grid of 10x20 with n=4 slots per hour, but we train the model on a grid size of 5x10 with n=2 slots per hour. Because these are integer multiples, it is easy to resize the *-data.npz files. If we want the higher-resolution data, it is already processed and available.
This program loads in csv files from ../decompressed/FOIL(year)/trip_data_month/.csv. (E.g. ../decompresed/FOIL2010/trip_data_1.csv)
Then, it saves the processed data to (year)-(month)-data.npz. (E.g. 2010-01-data.npz) The format of the vdata (volume-data) and fdata (flow-data) follow the structure used with the data provided for the STDN. This example from the Python interpreter shows how to load the data:
>>> import numpy as np; data = np.load("2010-01-data.npz")
>>> vdata = data['vdata']; vdata.shape
(2976, 10, 20, 2, 2)
>>> fdata = data['fdata']; fdata.shape
(2, 2976, 10, 20, 10, 20, 2)
>>> trips = data['trips']; trips.shape
(2, 2, 2)
>>> errors = data['errors']; errors.shape
(2,)
Above, the data has 4 time slots per hour (there are 2976 such slots in January), with a grid of size 10x20. Note: vdata and fdata both have an extra axis of size 2. This is to store the tripcount and passenger count separately.
For a trip, the starting volume data is stored if the trip starts in Manhattan. The ending volume data is stored if the trip ends in Manhattan. This means a trip can start inside but end outside, or start outside and end inside, and still be counted in the volume data.
E.g. vdata[113, 2, 4, 1, 0] gives the total number of passengers across all taxis in NYC for trips ending during time slot 113 in grid location (2, 4).
For a trip, the flow data is stored only if it starts and ends within Manhattan.
E.g. fdata[1, 117, 2, 4, 3, 5, 1] gives the total number of trips from (2,4) to (3,5) that end in time slot 117 but starts in an earlier time slot.
The 'trips' and 'errors' array records statistical information about the trips.
errors: Has one axis. 0 for the number of invalid trips (per utils.check_valid()), 1 for the number of unparsable trips.
trips:
E.g. The number of all trips that started in Manhattan = np.sum(trips[0,:,1]).
We intend to merge the resulting data into two large fdata and vdata arrays, spanning Jan 2010 to Dec 2013, with w=5, h=10, n=2.
Run the code on the default settings
python3.6 main.py -v
After a crash on May 2011, restart the script
python3.6 main.py -v -sm 4 -sy 2011 --restart
Get just the data for 2012
python3.6 main.py -v -sm 12 -sy 2011 -ey 2012 -em 12 --restart
Get the data for the default 2010-2013 period, but save it in a 5x10 array with only 2 samples per hour
python3.6 main.py -v -x 5 -y 10 -n 2