r-transit / gtfsio

Read and Write General Transit Feed Specification (GTFS)
https://r-transit.github.io/gtfsio/
Other
13 stars 3 forks source link

subset functions #14

Closed polettif closed 3 years ago

polettif commented 3 years ago

@dhersz what are the design decisions behind masked subset functions (e.g. $.gtfs)? I don't really see their benefit and it causes some hassle with assigning new auxilliary tables. (Well, I could overwrite them for tidytransit with $.tidygtfs or something like that but I think it's worth discussing for gtfsio)

dhersz commented 3 years ago

Hi @polettif.

I explained some of the decisions behind this function in the The 'gtfs' class discussion. I'll copy the relevant bit at the end of this post, so it doesn't clutter the rest of it.

To be short, the [.gtfs is necessary to preserve the gtfs class when subsetting it with [ (otherwise it would be of class list). The $ and [[ methods should behave exactly like the default methods, but raise warnings if the element you're trying to subset doesn't exist.

What's the issue you're facing when trying to assign new auxiliary tables? I'm able to do it without any problems:

library(gtfsio)

gtfs_path <- system.file("extdata/ggl_gtfs.zip", package = "gtfsio")
gtfs <- import_gtfs(gtfs_path)

gtfs$oi <- data.frame(a = 1, b = 2)

gtfs$oi
#>   a b
#> 1 1 2

gtfs$oie
#> Warning: This GTFS object doesn't contain the following element(s): 'oie'
#> NULL

AFAIK these methods shouldn't interfere with assignment, only subsetting. But I may very well be wrong.

I like this warning message just to be more specific why you're getting the NULL output. I took this idea from {tibble}, which also raise warnings in such cases:

a <- tibble::tibble(a = 1, b = 2)

a$e
#> Warning: Unknown or uninitialised column: `e`.
#> NULL

b <- data.frame(a = 1, b = 2)

b$e
#> NULL

{gtfsio} also include custom methods for subsetting GTFS objects. The first, which was actually shown above, is that subsetting it with [ preserves the gtfs class ([.default would strip the class from the object):

class(gtfs)
#> [1] "gtfs"

class(gtfs["shapes"])
#> [1] "gtfs"

class(gtfs[c("shapes", "trips")])
#> [1] "gtfs"

The $ and [[ operators were also tweaked to raise warnings when attempting to subset missing elements (this warning aside, they behave exactly like the default operators).:

gtfs["ola"]
#> Warning: This GTFS object doesn't contain the following element(s): 'ola'
#> $<NA>
#> NULL

gtfs$ola
#> Warning: This GTFS object doesn't contain the following element(s): 'ola'
#> NULL

gtfs[["ola"]]
#> Warning: This GTFS object doesn't contain the following element(s): 'ola'
#> NULL

polettif commented 3 years ago

Ah sorry, I missed your explanations on the gtfs class! Also, as it turns out I wasn't creating the gtfs object correctly on the gtfsio branch, so it's not really an issue with additional tables.

However, my issue is this: How should I check if a feed contains a table? tidytransit::set_date_service_table does this via !is.null(gtfs_obj$calendar_dates) which does work but issues an unnecessary warning.

In general I think gtfs should inherit from list and expected/accepted behaviour for list missing list entries is NULL without warnings. Is there any reason not to inherit from list? Granted, S3 classes are not really my strong suit.

dhersz commented 3 years ago

No worries!

However, my issue is this: How should I check if a feed contains a table? tidytransit::set_date_service_table does this via !is.null(gtfs_obj$calendar_dates) which does work but issues an unnecessary warning.

So, I created another discussion that discusses exactly this haha, here's the link: Utility functions to be used across packages (perhaps I should have pinged you there, I thought that you'd be notified anyway since you're and admin of r-transit).

In short, I created some functions to check these cases: assert_files_exist(), assert_fields_exist() and assert_fileds_types() (and the equivalent check_*() functions). These checks are "recursive" (i.e. when checking the field type first you check if the file exists and if the field exists). So let's say, for example, that you need to check if gtfs$calendar_dates$service_id is character or not inside tidytransit::set_date_service_table() (I haven't checked the code, just an example). Then you could use:

library(gtfsio)

gtfs_path <- system.file("extdata/ggl_gtfs.zip", package = "gtfsio")
gtfs <- import_gtfs(gtfs_path)

assert_fields_types(
  gtfs, 
  "calendar_dates", 
  fields = "service_id", 
  types = "character"
)

You can see what I mean by "recursive" checks below (run_assert() was created just to save some typing):

run_assert <- function() {
  assert_fields_types(
    gtfs, 
    "calendar_dates", 
    fields = "service_id", 
    types = "character"
  )
}

gtfs$calendar_dts <- gtfs$calendar_dates
gtfs$calendar_dates <- NULL
gtfs$calendar_dts[, service_i := factor(service_id)]
gtfs$calendar_dts[, service_id := NULL]

run_assert()
#> Error in assert_files_exist(x, file): The GTFS object is missing the following required element(s): 'calendar_dates'

gtfs$calendar_dates <- gtfs$calendar_dts

run_assert()
#> Error in assert_fields_exist(x, file, fields): The GTFS object 'calendar_dates' element is missing the following required column(s): 'service_id'

gtfs$calendar_dates[, service_id := service_i]

run_assert()
#> Error in assert_fields_types(gtfs, "calendar_dates", fields = "service_id", : The following columns in the GTFS object 'calendar_dates' element do not inherit from the required types:
#>   - 'service_id': requires character, but inherits from factor

gtfs$calendar_dates[, service_id := as.character(service_id)]

run_assert()

In general I think gtfs should inherit from list and expected/accepted behaviour for list missing list entries is NULL without warnings. Is there any reason not to inherit from list? Granted, S3 classes are not really my strong suit.

Having said all of that above, I see that it can be a hassle to change code inside {tidytransit} just because I'm suggesting the use of some functions and because I changed the default behaviour of subsetting functions.

So I'll happily remove this warning if you think it's not adequate. I just thought it could be a cool feature, and since I was already playing with the methods it felt ok to change it, but it didn't occur to me that I could be breaking/changing something you had already written.

These subsetting methods aside, and a custom print() method I describe in the gtfs class discussion as well, a gtfs object behaves exactly like a list.

typeof(gtfs)
#> [1] "list"
polettif commented 3 years ago

So, I created another discussion that discusses exactly this haha, here's the link: Utility functions to be used across packages (perhaps I should have pinged you there, I thought that you'd be notified anyway since you're and admin of r-transit).

I was notified but honestly it's hard to keep track of all the discussions, sorry.

Thanks for your explanation. I do appreciate your assert_* and check_* approach and I'll look into how they could be implemented in tidytransit. They could probably replace the simplistic feed_contains.

So I'll happily remove this warning if you think it's not adequate.

Yes, I'd prefer it to be removed as it's not really expected behavior for a subclass of list.

Another point I think I already brought up elsewhere: I think warnings should be used as rarely as possible, especially in a low level package. Warnings IMO can be ignored too often (i.e. aren't that important). They clutter the console which then leads to users overlooking warnings. If a warning is crucial it should be an error and users should be forced to handle it (modifying input, using different params).

These subsetting methods aside, and a custom print() method I describe in the gtfs class discussion as well, a gtfs object behaves exactly like a list.

Good to know, thanks. I wasn't aware that "implicit classes" are not returned with class():

class(gtfs)
#> [1] "gtfs"
dhersz commented 3 years ago

Cool, I removed the custom methods for $ and [[, and removed the occasional warning from [ in https://github.com/r-transit/gtfsio/commit/27602c840531da8fd9cb84ecfe8f4cbf267d52f8. Now the latter is used only to keep the gtfs class (and any subclasses) to subsetted objects.

That seems to close the issue, please re-open if you still see any problems! Cheers

polettif commented 3 years ago

That seems to close the issue

It does, thank you!