swangcs / BusGPS

This project aims at the big data challenges for predicting bus arrival time using GPS datasets.
13 stars 7 forks source link

refactoring data pre-processing #9

Closed swangcs closed 4 years ago

swangcs commented 5 years ago

Data pre-processing for one-day data no. 15 bus line has achieved good results so far. To improve the quality of your work before closing this part:

swangcs commented 5 years ago

I have uploaded my scripts for trip segmentations:

@Ruixinhua @MrTornado24 please feel free to test on other bus lines and other days, let me know how it goes..

Ruixinhua commented 5 years ago

I tested on bus line "46", "145" and "15", and I found a small bug when tested on line "15" which small_groups are detected and dDistance is not calculated for each group. I fixed this bug at commit 521bf05 and commit 6149f01. And all the codes work fine on my computer.

swangcs commented 5 years ago

@Ruixinhua Well spotted! Good job! Further on this related issue, in commit bfa0e8072de8d45284af18cb99f10ac4a0cb7f72 I reset the "dDistance" value of the first row to "0", when multiple trips occurred in one group (split by time_threshhold). Before this reset, "dDistance" of the first row remains the old value (usually big, as it beyond the time threshold), which lead to inaccurate travelled distance calculation. This condition is then used for filtering.

swangcs commented 5 years ago

@Ruixinhua @MrTornado24 please briefly check the latest commit up to 7a62c52f782f043c296471fe9d7934f1fa111384

Ruixinhua commented 5 years ago

@swangcs When converting the GPS points in meters, I found there are too many stopping points around the start location which increases the cumulate time and will affect the result of the prediction. In filter_gps.py, it should determine where the bus really begins with and remove the stopping points around the start position.

swangcs commented 4 years ago

@swangcs When converting the GPS points in meters, I found there are too many stopping points around the start location which increases the cumulate time and will affect the result of the prediction. In filter_gps.py, it should determine where the bus really begins with and remove the stopping points around the start position.

I tend to think it is normal and we only focus on arrival time at bus stops, deciding exact departure time is not important.

Overall data preprocessing looks fine now, one last comment is from @Ruixinhua 's experiment, only epoch time of departure cannot uniquely determine a bus trip for a certain bus line.

The branch is merged, so I close this issue.