Closed polettif closed 3 years ago
@polettif I'm glad you found the implementation useful, and I'm happy to see that tidytransit
reading function is now actually faster than gtfstools
' :)
Just to let you know, I just happened to spend some time today looking at gtfstools::read_gtfs()
again and I changed the function behaviour when facing parsing failures. Previously it would raise a warning and output the warning message detailing where the failure happened. Now it raises an error instead.
You dealt with this in tidytransit
using readr::problems()
, if I'm not mistaken you'd append a dataframe containing the parsing failures to the final GTFS object as an attribute, but data.table
does not have a similar function. I have an issue thread (https://github.com/ipeaGIT/gtfstools/issues/2) detailing my approach a bit further and linking to a data.table::fread()
issue as well that may come up in edge cases (but to be honest I haven't faced this bug in a while now, they may have fixed it already).
You dealt with this in tidytransit using readr::problems(), if I'm not mistaken you'd append a dataframe containing the parsing failures to the final GTFS object as an attribute, but data.table does not have a similar function. I have an issue thread (ipeaGIT/gtfstools#2) detailing my approach a bit further and linking to a data.table::fread() issue as well that may come up in edge cases (but to be honest I haven't faced this bug in a while now, they may have fixed it already).
To be honest, I never really used the appended problems df (I'm not sure it even works?), I think the better approach would be to just issue warnings. I usually prefer errors over warnings (warnings are often ignored) but for parsing failures they're better I think. Maybe a very ugly workaround would be using data.table::fread
and if there's an error, run readr::read_csv
to properly catch warnings.
@dhersz I looked into the read function implemented in gtfstools after you pointed out the speed increase and took the liberty to adapt it for tidytransit. You did some great work with the implementation there, thanks! I realized that tidytransit's read functions were quite cluttered and a bit over-engineered. I couldn't quite figure out where the main bottleneck was, even after using
data.table::fread
, I guess there was too much piping going on. As you can see below, it's noticeably faster than the current implementation and comparable to gtfstools.Anyway, this PR basically clones
gtfstools::read_gtfs
with some changes to make it work with tidytransit. The main changes besides some refactoring are:set_dates
is used withinread_gtfs
which replaces string date columns with Date objects. This is also a work in progress and necessary since other tidtransit functions depend on dates being Dates.This PR is generally linked to #160 (pinging @mpadge as well) but IMO the speed improvements are valuable just for tidytransit.
Created on 2021-02-10 by the reprex package (v0.3.0)