Open prasadpawarr opened 5 years ago
Hi, I'm working on breaking up the big file by date and route_id.
Initial exploration:
Strategy:
Since it was an extremely large file (1GB) we(Milind Thombre and Bhushan Pagare) used a data reduction strategy to build our prototype. The file was not even opening on our Desktops initially
$ cut pmpml_gpslogs_19.1.19_4pm_to_27.1.19_6pm.csv --field=1,6,15,17,9,10 -s -d , > c.xlsx $ head --lines=1000 c.xlsx > d.xlsx &
I have analysed the data and we have data about location of bus after every 30 second . I have an idea that we can calculate the speed of bus for every 30 seconds by calculating approximate distance between two stops using circle distance approach. avg speed = speed over total distance of trip we can divide the whole route into distance of 100 /150 m and calculate speed for every part,then comparing with avg speed we can say that this part route is chocked or not.
Hi folks, I split the data route-wise, and separately, vehicle-wise.
Here's the code used : https://gist.github.com/answerquest/b6e21a7545f85c64a815dfcb43523d82
I used the compression='gzip'
to read the .tar.gz directly, and chunksize=1000000
in python/pandas pd.read_csv()
to process ~.5~ 1 million rows at a time - that ensured the program didn't run out of memory.
Compressed output:
7z -mx=3 a "gpslogs_route_vehicle_split.7z" "by_route/" "by_vehicle/"
Download : https://nikhilvj.co.in/files/pmpml/gpslogs_route_vehicle_split.7z (install 7zip to unzip this)
I plotted one route 37 that originates near my home on QGIS, and it was going all around town - didn't make any sense. I also filtered by trip_id and date but no dice. One guess is that a vehicle that first plies route 37 is then dispatched on another route, but the route_id reported by the device isn't changing.
Then I decided to split the data, separately, by vehicle_id too. And the resulting csv's are looking bit more sensible on QGIS.
One observation in the per-vehicle data: there's still lot of redundancy. Even though timestamps are incrementing by some seconds, the lat-long reported is exact same (and that's down to many decimals, so v.low chance of them being independent readings).
But still I'd advise against directly de-duping this - we'll have to go about it smartly. Because if a location of a moving vehicle didn't update for say 3 mins then that too is data. That duration might indicate stoppage at stops, traffic signal or traffic jam (and if there's not much movement from last then device may be programmed to not send another reading). But it might also be a GPS device malfunction / under-performance. (This could be determined by seeing how far the next recorded location is - is there a jump that can't be explained by physics?). It can result in important feedback about device performance that the transport authorities can zero in on and hold the vendor accountable to, and we shouldn't lose it.
So one way of going about this could be: in a contiguous set A to Z
consecutive timestamps with identical lat-longs, retain A
and Z
and drop the ones in the middle. Then, Z-A
(the timestamp
column is in epoch format so simple subtraction gives seconds interval.) gives us that stoppage / device inactivity data, and the full dataset is also free of redundancy. I'll leave this for more folks to crack for now.
Nice, this is the correct approach. The 200mts grid was not the way to go.
Regards Milind
On Sat 2 Feb, 2019, 6:21 PM vrusha97 <notifications@github.com wrote:
I have analysed the data and we have data about location of bus after every 30 second . I have an idea that we can calculate the speed of bus for every 30 seconds by calculating approximate distance between two stops using circle distance approach. avg speed = speed over total distance of trip we can divide the whole route into distance of 100 /150 m and calculate speed for every part,then comparing with avg speed we can say that this part route is chocked or not.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/opendatapune/Problem-Statements/issues/1#issuecomment-459962818, or mute the thread https://github.com/notifications/unsubscribe-auth/AH3RSSFCfxjkZdcvmyY-ZN6YUmIVA4JDks5vJYnXgaJpZM4aaJ5u .
Great analysis!
Unless we study the data thoroughly, no useful report will come out of it. No point just hacking around without first understanding. There are no duplicate records, all records will be reqd for animating the bus route dynamically.
Also, imo we should be using FOSS software only(preferably MIT license) for plotting the map as well as animation. The python libraries are generally OK. I will look at trackanimation and geopy libraries, does anyone know others?
Regards Milind Thombre
On Sun 3 Feb, 2019, 8:23 AM Nikhil VJ <notifications@github.com wrote:
One observation in the per-vehicle data: there's still lot of redundancy. Even though timestamps are incrementing by some seconds, the lat-long reported is exact same (and that's down to many decimals, so v.low chance of them being independent readings).
But still I'd advise against directly de-duping this - we'll have to go about it smartly. Because if a location of a moving vehicle didn't update for say 3 mins then that too is data, it can result in important feedback about device performance that the transport authorities can find useful and hold the vendor accountable to, and we shouldn't lose it. Also, that duration might indicate stop stoppage or traffic jam.
So one way of going about this could be: in a contiguous set A to Z timestamps of constant lat-longs, retain A and Z and drop the ones in the middle. Then, Z-A (the timestamp column is in epoch format so simple subtraction gives seconds interval.) gives us that stoppage / device inactivity data, and the full dataset is also free of redundancy. I'll leave this for more folks to crack for now.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/opendatapune/Problem-Statements/issues/1#issuecomment-460018328, or mute the thread https://github.com/notifications/unsubscribe-auth/AH3RSSreexT0yC8WL-ZEHUWigoIw8qEzks5vJk8OgaJpZM4aaJ5u .
.gpx
is a universal standard for recording tracks with timestamps (also called gps traces), and thanks to the interest among IT sector folks in running, hiking and ladakh trips (:wink:) there's plenty of solutions around for animating them. So that's a good format to convert the data to.
There's multiple trips happening in these per-vehicle or per-route datasets. So detecting them and removing the in-between halt times at depots would be important to get discrete tracks of individual trips. Again, the halt times are data too, so a parallel exercise can be in documenting the halt times.
Note: splitting the data by date can cause problems for trips that cross midnight, which is why I didn't do that.
Subproblem:
This is streaming data, so we also need to look at a sliding window protocol (time window) so that old gps data can be archived offline and only last 50 reads are used for the animation.
Subproblem:
No point in having a desktop only application as citizens will only appreciate what they SEE. Anyone with Android app dev skills here? (Flutter skills) The citizen waiting at a bus stop should be able to enter bus route/bus number and app should show all buses inbound to that bus stop on the map as well as ETA(Estimated time of Arrival). This will get appreciated like anything by citizens who use the bus service. I don't think it's terribly difficult to do either once the data is ready for consumption.
Do let me know what you think.
For now, I'm going to continue exploring trackanimation and geopy.
Cheers Milind
On Sun 3 Feb, 2019, 11:48 AM Nikhil VJ <notifications@github.com wrote:
.gpx is a universal standard for recording tracks with timestamps (also called gps traces), and thanks to the interest among IT sector folks in running, hiking and ladakh trips (😉) there's plenty of solutions around for animating them. So that's a good format to convert the data to.
There's multiple trips happening in these per-vehicle or per-route datasets. So detecting them and removing the in-between halt times at depots would be important to get discrete tracks of individual trips. Again, the halt times are data too, so a parallel exercise can be in documenting the halt times.
Note: splitting the data by date can cause problems for trips that cross midnight, which is why I didn't do that.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/opendatapune/Problem-Statements/issues/1#issuecomment-460026673, or mute the thread https://github.com/notifications/unsubscribe-auth/AH3RSRmeiKNimRNeYgzm78-lDf-g0EXWks5vJn9JgaJpZM4aaJ5u .
Hi @thombrem , it would be good to start a thread elsewhere to discuss live / streaming data. As of now, PMPML isn't giving out that data (esp if there's even a remote chance of problems in it), and what we have is archived, historical data for analysis in hackathon, obtained through special arrangement.
Live data is a huge subject in itself, wouldn't do justice to take it forward here where we want to work on archived data. If you could start an issue at the QnA repo and link back here, that would be great. Or feel free to start at your own repo too. Over there I'll share more details about the exact format etc of the streaming data.
OK yes, in the context of the hackathon you are right Nikhil.
I was going to look at the streaming data issue after the GPS tracking was resolved fully using open src libraries, but your point is taken.
Thx Milind
On Sun 3 Feb, 2019, 12:25 PM Nikhil VJ <notifications@github.com wrote:
Hi @thombrem https://github.com/thombrem , it would be good to start a thread elsewhere to discuss live / streaming data. As of now, PMPML isn't giving out that data (esp if there's even a remote chance of problems in it), and what we have is archived, historical data for analysis in hackathon, obtained through special arrangement.
Live data is a huge subject in itself, wouldn't do justice to take it forward here where we want to work on archived data. If you could start an issue at the QnA repo https://github.com/opendatapune/QnA/issues and link back here, that would be great. Or feel free to start at your own repo too. Over there I'll share more details about the exact format etc of the streaming data.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opendatapune/Problem-Statements/issues/1#issuecomment-460028353, or mute the thread https://github.com/notifications/unsubscribe-auth/AH3RSVkZF6eZpuDcXLCJzTYLDt_MkH-vks5vJofpgaJpZM4aaJ5u .
Got trackanimation working late last night. I need maps API Key as Google maps is no longer free. Can anyone help?
I have my cloud a/c too, but don't want to put my key in there for obvious reasons.
Does pmc have a Google maps API key that they can share?
Milind
On Sun 3 Feb, 2019, 12:43 PM Milind Thombre <thombrem@gmail.com wrote:
OK yes, in the context of the hackathon you are right Nikhil.
I was going to look at the streaming data issue after the GPS tracking was resolved fully using open src libraries, but your point is taken.
Thx Milind
On Sun 3 Feb, 2019, 12:25 PM Nikhil VJ <notifications@github.com wrote:
Hi @thombrem https://github.com/thombrem , it would be good to start a thread elsewhere to discuss live / streaming data. As of now, PMPML isn't giving out that data (esp if there's even a remote chance of problems in it), and what we have is archived, historical data for analysis in hackathon, obtained through special arrangement.
Live data is a huge subject in itself, wouldn't do justice to take it forward here where we want to work on archived data. If you could start an issue at the QnA repo https://github.com/opendatapune/QnA/issues and link back here, that would be great. Or feel free to start at your own repo too. Over there I'll share more details about the exact format etc of the streaming data.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opendatapune/Problem-Statements/issues/1#issuecomment-460028353, or mute the thread https://github.com/notifications/unsubscribe-auth/AH3RSVkZF6eZpuDcXLCJzTYLDt_MkH-vks5vJofpgaJpZM4aaJ5u .
@thombrem I don't think we can count on something like that ;) I'll encourage folks to move to open source libraries like Leaflet.js, OpenLayers, Mapbox for web mapping. There's functionally no difference so why bother with api keys, external dependencies and all, esp if the data is with us only and we just want to show it on a map. Openstreetmap background or its derivative flavors by mapbox etc are free and just as good; even satellite view is available from sources like ESRI and Bing without charge. I work in Leaflet personally; might be able to help you migrate if you share the code . Have you converted the data to another form, or using the same csv as input? (that's important!)
(btw see this faq point).
Nikhil, trackanimation is open src Apache 2.0 license, but with a Google maps dependency for rendering only. Perhaps they chose Google maps when it was free, but Google has now made it nominally paid after a limit.
Will check-in python code sometime today after cleanup and optimizations and doc. Will also explore the libraries you have mentioned. Also, perhaps there may be a way to point trackanimation to one of these rendering tools like OpenStreetMap.
I used csv file as is on first pass.
Regards Milind
On Tue 5 Feb, 2019, 9:16 AM Nikhil VJ <notifications@github.com wrote:
@thombrem https://github.com/thombrem I don't think we can count on something like that ;) I'll encourage folks to move to open source libraries like Leaflet.js, OpenLayers, Mapbox for web mapping. There's functionally no difference so why bother with api keys, external dependencies and all, esp if the data is with us only and we just want to show it on a map. Openstreetmap background or its derivative flavors by mapbox etc are free and just as good; even satellite view is available from sources like ESRI and Bing without charge. I work in Leaflet personally; might be able to help you migrate if you share the code . Have you converted the data to another form, or using the same csv as input? (that's important!)
(btw see this faq point https://github.com/opendatapune/opendatapune.github.io/blob/master/FAQs.md#i-want-to-retain-copyright-over-the-code-i-make ).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opendatapune/Problem-Statements/issues/1#issuecomment-460502958, or mute the thread https://github.com/notifications/unsubscribe-auth/AH3RSZrQLkI9eUH_55qKNsUBqe5sTXzmks5vKP5_gaJpZM4aaJ5u .
I found a python namesake when searching it on web, and this one uses leaflet, and woooowww somebody do this!
I have done some detail work on this dataset. My aim was to create some form of relative speed metric. Here is what I found and were able to do. 1) Remove the columns that are redundant or give info about collection
2) Remove records with latlongs that fall outside pune.
3) The Trip_ID is connected to trips file in gtfs. But many a times the trip id does not get updated when the vehicle turns back after reaching destination of its first trip and starts another trip. Therefore I have sorted down the data sets by combination of Vehicle id and Time. DeDuplicated the dataset if two successive stops have the same lat longs. The method I have used, helps in removing intermediate stops and just reduces the vehicle speed if it moved after 5-6 minutes. 4) This way we get time lag and distance between successive stops. (Used root of squared diff of lat long as crude approximation of distance. 6) COnverted time difference to seconds and did multiplications to bring the crude distance to a pretty number. 5) dropped the observations where the vehicle did not do much readings in a day. Also, where the time difference was for more than 5 Minutes. This left me with about 2M readings that had distance and timediff readings. 7) Took this to powerBI and used the mapbox visual to show the lat longs on map. since there were too many points, I rounded off their lat longs and averaged the speed for that mean latlong pair. The visual gives the results that are ok at first inspection. Speeds are lower in city center and higher as we move away, there is congestion at intersections. The visual is not much customizable to show speed in absolute colour (i.e. it cannot colour based on values, it only shows relative colour so intersections stay red even when they are unblocked at night.), so I will try using Flavium inside the powerBi. 0_logs_cleaning.txt
here is the link to the visual. I will be doing some more customizations. https://app.powerbi.com/view?r=eyJrIjoiNzlmZDcyYzgtNmMzMi00NWY0LThmNWItMDBmNDEzNTIyMTAzIiwidCI6ImU5NWNiOWQ3LTk4N2YtNGZmMy1iNjliLWVkOGIxOGNmMDBmOSJ9 Let me know if you tried some easier approach or could spot some errors in logic and calculations. Thanks.
https://github.com/opendatapune/Problem-Statements/wiki/PMPML-GPS-Logs-of-8-days