r-transit / tidytransit

R package for working with GTFS data
https://r-transit.github.io/tidytransit/
143 stars 21 forks source link

put calculated feeds in a sublist of the gtfs_obj #63

Closed tbuckl closed 5 years ago

polettif commented 5 years ago

As an idea, we could put all calculated tidytransit data frames and objects in a sublist in gtfs_obj, like gtfs_obj$tidytransit$frequencies or something shorter? Same for gtfs_obj$tidytransit$headways. That way we don't clutter up the main object and we can avoid possible conflicts with already existing filenames. Plus, it's easier to see at once which data frames have been generated.

mpadge commented 5 years ago

This is exactly the issue I'm grappling with with gtfs-router and the timetable data. My current inclination is to have a binary parameter, process_feed = TRUE or something with a clear explanation there if what is added to the actual feed data

polettif commented 5 years ago

What's the status on this? While working on the timetable vignette I realized it's also easier if we know where to look for calculated data while also keeping the top level of gtfs_obj clean.

mpadge commented 5 years ago

I implemented a simple local cache with the following code:

# cache object using hash from hash_object
cache_file <- function (hash_object, object, prefix = "")
{
    hash <- paste0 (prefix, digest::digest (hash_object), ".Rds")
    saveRDS (object, file = file.path (tempdir (), hash))
}

load_cached_file <- function (hash_object, prefix = "")
{
    hash <- paste0 (prefix, digest::digest (hash_object), ".Rds")
    fname <- file.path (tempdir (), hash)
    ret <- NULL
    if (file.exists (fname))
        ret <- readRDS (fname)
    return (ret)
}

BUT the problem is that the gtfs_stops file is so big that calculating a digest of that takes up to a few seconds, making caching no quicker than just re-calculating timetables anyway. (My gtfsrouter timetables are different to yours, but principle still applies.)

I've currently reverted to a separate gtfs_timetable() function which returns the timetable object; otherwise it's just re-calculated each time. (Never takes more than a couple of seconds anyway.)

polettif commented 5 years ago

Thanks for pointing me to digest, looks like a better approach than the openssl implementation I was using.

Slightly off topic (as I'd like to discuss the gtfs_obj$tidytransit$...approach mentioned above) but wouldn't it be easier to hash the parameters used to generate the object (departure times, stops, ...) instead of the whole object?

mpadge commented 5 years ago

Unfortunately, my firm belief is that for GTFS data, the whole object has to be hashed because it is highly likely that people will do their own filtering, especially on stop_times. Once that's potentially been done, hashing on anything else may well lead to catastrophe

tbuckl commented 5 years ago

@polettif on my end i have not done any work on this so far. still seems like a great idea. not sure when i'll get to it, but it does seem pretty easy in theory.

it might be a nice change to get in and then release to cran within 2 weeks or so?

i can maybe check on it over the next weekend or two.

polettif commented 5 years ago

on my end i have not done any work on this so far. still seems like a great idea. not sure when i'll get to it, but it does seem pretty easy in theory.

it might be a nice change to get in and then release to cran within 2 weeks or so?

Ok. I'll start using it on the the vignette branch then.

Unfortunately, my firm belief is that for GTFS data, the whole object has to be hashed because it is highly likely that people will do their own filtering, especially on stop_times. Once that's potentially been done, hashing on anything else may well lead to catastrophe

I see, good point.