r-transit / gtfsio

Read and Write General Transit Feed Specification (GTFS)
https://r-transit.github.io/gtfsio/
Other
13 stars 3 forks source link

download feeds from urls without zip extension #28

Closed polettif closed 2 years ago

polettif commented 2 years ago

MobilityData provides a file with direct download links for feeds. I'm working on resolving tidytransit#172 by using their csv file.

However, some URLs (~22%) don't end with "zip"

MobilityData.csv = read.csv("https://bit.ly/catalogs-csv")

# skip gtfs realtime data
MobilityData_gtfs = MobilityData.csv[MobilityData.csv$data_type == "gtfs",]

file_extensions = tools::file_ext(MobilityData_gtfs$urls.direct_download)

table(file_extensions)
#> file_extensions
#>           0  ashx  aspx  gtfs phtml   zip   ZIP 
#>   283     2     1     4     2     1  1037     2

Most of them (I guess, I haven't checked all) still provide a zip file with the download. If I skip this line https://github.com/r-transit/gtfsio/blob/984f821c9874228bd07e236f3169464291594905/R/import_gtfs.R#L88

the download and reading in this reprex works fine:

gtfsio::import_gtfs("http://nycferry.connexionz.net/rtt/public/utility/gtfs.aspx")
#> Error: 'path' must have '.zip' extension.

I think gtfsio should be able to handle all possible URLs. There's multiple ways to resolve this:

I don't think skipping the zip check completely is a good way to handle it and we can't really cover all sensible extensions. I thinks the easiest approach is to simply download the file and then check if it can be unzipped. There might be security issues somewhere, I don't know, but if you can't use a URL with import_gtfs you have to download the file somewhere else (browser, utils::download.file, ...) anyways.

dhersz commented 2 years ago

Thanks for bringing up this issue @polettif.

As you mentioned, I don't think we can cover all possible extensions when checking the URL, and I don't like ignoring it altogether either. I like your proposal of downloading the file and then checking if it can be unzipped, but is there a reliable way of doing that? There might be something in {zip} that can help us with that, but I haven't personally used it before, so not sure.

PS: I glanced through the GTFS reference to see what they say about file extensions/zipping feeds. The only "zip" occurrences are:

Term definitions

This section defines terms that are used throughout this document.

  • Dataset - A complete set of files defined by this specification reference. Altering the dataset creates a new version of the dataset. Datasets should be published at a public, permanent URL, including the zip file name. (e.g., https://www.agency.org/gtfs/gtfs.zip).

(...)

File requirements

(...)

  • All dataset files must be zipped together.

So it seems like the reference assumes that feeds must be zipped with .zip extension, but it's not terribly clear. Do you agree with my interpretation? I might open an issue in the reference repo to clarify that (not saying that we shouldn't follow your proposal, I think we should - especially because we are quite flexible on reading non-standard compliant feeds anyway, column types aside -, but just for clarity sake).

polettif commented 2 years ago

I like your proposal of downloading the file and then checking if it can be unzipped, but is there a reliable way of doing that?

My idea was:

Something along those lines. Using tryCatch is not nice and clean but I think it's not possible nor necessary to check if the file is really a zip file before unzipping. Also, using zip::zip_list("file.zip") also involves catching an error so why not just unzip it which should run fine if it's really a zip file.

So it seems like the reference assumes that feeds must be zipped with .zip extension, but it's not terribly clear. Do you agree with my interpretation? I might open an issue in the reference repo to clarify that

Yes, I read it the same way. I don't think a new issue is necessary though, the current wording is clear enough I wouldn't spend time on gtfs reference discussing it... 😄

dhersz commented 2 years ago

Hi @polettif, sorry for taking long to look at this. https://github.com/r-transit/gtfsio/commit/a271bc3ae312bccb6cbacafdf94d4a87c46438f4 should include a fix to this, can you check on your side, please? import_gtfs() should now accept file paths and URLs that don't include a .zip extension for some reason or another but are still valid zip files.

polettif commented 2 years ago

No worries @dhersz. Works like a charm, with meaningful errors, thanks!