ropensci / unconf16

rOpenSci's San Francisco hackathon/unconf 2016
http://unconf16.ropensci.org
23 stars 7 forks source link

R package (gtfsr) to make working with GTFS (transit data) feeds easy #17

Open eamcvey opened 8 years ago

eamcvey commented 8 years ago

GTFS is a standard format for transit data (routes, stops, schedules, etc.). [There is also a real-time version of GTFS - I'm considering it out of scope for now] https://developers.google.com/transit/gtfs/ If it's easy to work with GTFS data in R, it will facilitate the creation of more sophisticated analysis tools for transit systems built on top of this package.

Get feed data into R:

Validate feed and assess data quality: (which is often poor)

Provide convenience functions for common tasks:

Facilitate creation of a GTFS feed from within R:

eamcvey commented 8 years ago

The ability to compare two versions of a gtfs feed from an agency and be shown the differences could be useful -- i.e. to see what changes a transit agency made.

hansthompson commented 8 years ago

I've been working at the intersection of R and GTFS for bit. I'm glad I came across this! I would be happy to try and push to any part of this project or lead it myself if no one has time.

In Anchorage, AK we use a script that we run continuously during the working hours of our bus line People Mover to generate the protocol buffer that is needed for the GTFS-FS.

https://github.com/codeforanchorage/api-realtime-bus

We run build_protobuf.R in the R directory continuously. I'm sure there is many things inelegant on how I wrote it but I wanted to include it as an example on how we generate the GTFS-RT feed using Dirk E's RProtoBuf package. We are using a feed from the existing vender service that calculates the delays by stop for us which takes out some of the brainy part of our project.

Thanks for exposing transitfeeds.com to me. Before I was looking at the GTFS Exchange which is another resource.

I've make a little script to process stops.txt and shapes.txt into sp objects before pushing them to PostGIS, which I think is the best platform to imo. Open to anything though. https://gist.github.com/hansthompson/a3d2c710ac8e3584d58. The bits inside this gist that convert them to shapefile using WriteOGR could be useful though.

If the PostGIS seems like a good way forward, I would need some help addressing the three concerns I see for this kind of conversion with GTFS that I list at the top of the Gist.

  1. It would be great to use some service to find the best state plane projection (or other) for accurate metric measure of distance instead of WGS84.
  2. I'm not sure what kind of time data type could be used for stop_times.txt that would account for the time of day but including time that goes around the clock past 24 hours. (PLEASE SOMEONE WHO KNOWS POSTGRES HELP!)
  3. I'm not yet so good at postgres admin stuff so how could this be created temporarily without admin privileges?

On the not on checking errors within the GTFS feed, the google dash is pretty excellent if you want to throw a gtfs feed against it in testing mode. It would be nice to try it outside the google platform though. I would like a mapping function using leaflet for the testing that could show the routes and the expected positions of buses at a specified time during the day.

I'm also interested in getting the network analysis involved to show the network (maybe in a given time window?) And also showing the network analysis parameters spatially once its done. Here's a pretty rough idea. http://akdata.org/misc/gtfs_network.html. I'm taking a course on network analysis currently and would love to make this a end of semester project that could be generalized to any GTFS.

Finally, perhaps outside the borders of this project is creating a delay analysis package that could take the GTFS and the real time gps data in some standard format to build a protobuf server to scale real-time updates for google for anywhere there is A. GTFS and B. gps on board.

hansthompson commented 8 years ago

@rustyb has a package called GTFSr that might be a good resource to build off of as well.

https://github.com/rustyb/GTFSr

eamcvey commented 8 years ago

@hansthompson Thanks, I'm checking out GTFSr! And thanks for all the information, I am digesting it. It would be great to be able to build on existing stuff.

eamcvey commented 8 years ago

@hansthompson The list to the gist you provide appears to be broken (or I don't have access?)

hansthompson commented 8 years ago

Sorry. I'm new to Gists. Try this one.

https://gist.github.com/hansthompson/a3d2c710ac8e3584d58c

hansthompson commented 8 years ago

I can't get the GTFSr vignette to compile. If you get it working would you mind sharing a copy?

rustyb commented 8 years ago

Howdy Folks - Thanks for the interest in GTFSr and my apologies for not getting back to you sooner. GTFSr was a wee project for an R course in college.

I've a funny feeling I might not have the actually working version on github. Will dig it out on my machine and get it working again tomorrow.

hansthompson commented 8 years ago

Just wanted to make a plug for a package I started for network analysis of GTFS this weekend.

https://github.com/hansthompson/gtfsnetwork

It will convert the GTFS files into an edge list and do some filtering by time and service id.

I'm not sure how to write packages for S4 objects though so I just read in the files as seperate data.frames. What are your thoughts of this @eamcvey and @rustyb ?

eamcvey commented 8 years ago

@hansthompson Things like this network analysis are exactly what I hope would be built into/on top of the package I was envisioning. At minimum, the package should make it easy to get GTFS feeds, assess the quality of the data, save it in useful gtfs object, and make it convenient to do the types of joins that would be most common. I have a start on some of these features that I'll put into a public repo by the end of the week. Then ideally getting the data to the starting point for network analysis is very easy, and you can focus on the network part.

hansthompson commented 8 years ago

Cool. I'll look forward to it! What are your thoughts on an rmarkdown like output of the feed validation with charts that show when service ids run and maps of the stops, etc?

Emaasit commented 8 years ago

@eamcvey & @hansthompson Great discussion thus far. I would like to jump in too. I was wondering if the public repo that @eamcvey planned to create was ready. You could outline some specific tasks that we can start working on.

eamcvey commented 8 years ago

Better late than never - the code I've started on is finally in a public repo here: https://github.com/ropenscilabs/gtfsr I've got the basic functionality to pull feeds from the transitfeeds.com API, putting all the feed data into a list of dataframes (not yet a class, because I'm not sure what level of validation there should be), and creating a validation dataframe as part of that list to start characterizing the data quality of the feed. There is more to be done on data validation (checking that the ids in different data frames match up where they should, for example), thinking to do about what the gtfs object should look like (maybe adapting existing code referenced in this discussion), and lots that could be built on top of this. I have a driver file in the repo that I used to test things out, and there are some functions in there I wrote on the fly that should get formalized.

Emaasit commented 8 years ago

4 Main Purposes of the Package

  1. provides API wrappers for popular public GTFS feed sharing sites,
  2. reads feed data into a gtfs data object,
  3. validates data quality,
  4. provides convenience functions for common tasks

convenience functions for common tasks may include;

  1. how to calculate fares,
  2. how to search for trips,
  3. how to optimize feed data
SymbolixAU commented 7 years ago

I started to put together my own package to handle the GTFS-realtime feeds - https://github.com/SymbolixAU/gtfsway

It uses the RProtoBuf package to load the .proto file in .onLoad(). Then the gtfs_realtime() function reads the binary result of a gtfs real-time response (although at the time of writing this it doesn't do anything with the data, I'm still working on it). For example, the realtime-feed for South East Queensland can be downloaded by

## south east Queensland
url <- "https://gtfsrt.api.translink.com.au/Feed/SEQ"
response <- httr::GET(url)

If you want I can make this into a 'formal' function and issue a PR to incorporate it into gtfsr ?

rafapereirabr commented 7 years ago

I'm really glad to read this thread and see more people are interested in using R to do network analysis of GTFS datasets. I hope to contribute more with the project in the future. For now, I share a similar initiative using Java, which can bring some useful insights. It was created by Tyler Green .

http://www.tyleragreen.com/blog/2017/03/graphing-transit-systems-part-ii-centrality/

bbrewington commented 7 years ago

@eamcvey Nice work! Do you know if there are plans to bring that package into CRAN?