r-transit / tidytransit

R package for working with GTFS data
https://r-transit.github.io/tidytransit/
150 stars 22 forks source link

Clone read_gtfs from gtfstools #161

Closed polettif closed 3 years ago

polettif commented 3 years ago

@dhersz I looked into the read function implemented in gtfstools after you pointed out the speed increase and took the liberty to adapt it for tidytransit. You did some great work with the implementation there, thanks! I realized that tidytransit's read functions were quite cluttered and a bit over-engineered. I couldn't quite figure out where the main bottleneck was, even after using data.table::fread, I guess there was too much piping going on. As you can see below, it's noticeably faster than the current implementation and comparable to gtfstools.

Anyway, this PR basically clones gtfstools::read_gtfs with some changes to make it work with tidytransit. The main changes besides some refactoring are:

This PR is generally linked to #160 (pinging @mpadge as well) but IMO the speed improvements are valuable just for tidytransit.

devtools::install_github("r-transit/tidytransit@6cc12f23a1a65804a810507974f748754a1420a0")
library(tidytransit)
path = system.file("extdata", "google_transit_nyc_subway.zip", package = "tidytransit")
microbenchmark::microbenchmark(
  tidytransit.master = tidytransit::read_gtfs(path),
  times = 10, unit = "ms"
)
#> Unit: milliseconds
#>                expr      min       lq     mean   median       uq      max neval
#>  tidytransit.master 1952.626 2286.438 2399.339 2439.211 2603.777 2780.352    10
devtools::install_github("r-transit/tidytransit@58d19b64a611ac9b9bf3a049297bda259c6bc38c")
path = system.file("extdata", "google_transit_nyc_subway.zip", package = "tidytransit")
microbenchmark::microbenchmark(
  tidytransit.improved = tidytransit::read_gtfs(path),
  gtfstools.read_gtfs = gtfstools::read_gtfs(path),
  times = 10, unit = "ms"
)
#> Unit: milliseconds
#>                  expr    min     lq    mean median     uq      max neval
#>  tidytransit.improved 689.12 710.91  920.35 724.13 1135.54 1454.31    10
#>   gtfstools.read_gtfs 550.12 594.07 1027.07 880.55 1514.77 1686.36    10

Created on 2021-02-10 by the reprex package (v0.3.0)

dhersz commented 3 years ago

@polettif I'm glad you found the implementation useful, and I'm happy to see that tidytransit reading function is now actually faster than gtfstools' :)

Just to let you know, I just happened to spend some time today looking at gtfstools::read_gtfs() again and I changed the function behaviour when facing parsing failures. Previously it would raise a warning and output the warning message detailing where the failure happened. Now it raises an error instead.

You dealt with this in tidytransit using readr::problems(), if I'm not mistaken you'd append a dataframe containing the parsing failures to the final GTFS object as an attribute, but data.table does not have a similar function. I have an issue thread (https://github.com/ipeaGIT/gtfstools/issues/2) detailing my approach a bit further and linking to a data.table::fread() issue as well that may come up in edge cases (but to be honest I haven't faced this bug in a while now, they may have fixed it already).

polettif commented 3 years ago

You dealt with this in tidytransit using readr::problems(), if I'm not mistaken you'd append a dataframe containing the parsing failures to the final GTFS object as an attribute, but data.table does not have a similar function. I have an issue thread (ipeaGIT/gtfstools#2) detailing my approach a bit further and linking to a data.table::fread() issue as well that may come up in edge cases (but to be honest I haven't faced this bug in a while now, they may have fixed it already).

To be honest, I never really used the appended problems df (I'm not sure it even works?), I think the better approach would be to just issue warnings. I usually prefer errors over warnings (warnings are often ignored) but for parsing failures they're better I think. Maybe a very ugly workaround would be using data.table::fread and if there's an error, run readr::read_csv to properly catch warnings.