remix / partridge

A fast, forgiving GTFS reader built on pandas DataFrames
https://partridge.readthedocs.io
MIT License
152 stars 22 forks source link

Build functionality to pull GTFS from a URL #43

Closed csb19815 closed 5 years ago

csb19815 commented 6 years ago

Downloading and navigating to a given GTFS feed for use in Partridge can be cumbersome and risky due to internet speeds and version control issues. A function to pull a fresh feed from a URL (e.g., from an agency's developer portal) would help with this.

See url2gtfs in https://davidabailey.com/articles/Visualizing-Public-Transportation-Speeds-with-Python for an example.

kuanb commented 6 years ago

"Downloading and navigating to a given GTFS feed for use in Partridge can be cumbersome and risky due to internet speeds and version control issues." ^ I think that is an excellent argument for why one should not rely on a URL endpoint to pull a GTFS feed! :)

Also, I suspect that @invisiblefunnel will cite the "As little as possible" section of this library's philosophy (https://github.com/remix/partridge#philosphy) and suggest that such a feature exist outside of partridge.

csb19815 commented 6 years ago

Great points! I should rephrase: in situations where I'm using partridge for a transformation of a feed that needs to occur each time the feed updates, I care about having the very latest version of that feed.

invisiblefunnel commented 6 years ago

@csb19815 I feel your pain on this workflow. Let me think about it.


@kuanb can you tell me about more about this point? Not sure I follow.

I think that is an excellent argument for why one should not rely on a URL endpoint to pull a GTFS feed!

kuanb commented 6 years ago

I could be wrong/misunderstanding, but it seems like, if there's a static endpoint that holds the latest version of a feed, then one could, on their own, write a simple method to download from that endpoint to a local file path (a la the below example pseudocode).

If the intent of partridge is to focus on the act of converting a zip file into pandas DataFrames, then my hunch would be that adding such a feature to download from a url may fall out of that scope.

The OP's comment that internet speeds and version control could pose problems are reasons why one might not want to rely on an such endpoint. If one reads a feed from a local environment, one theoretically can control the state of that feed - thus ensuring said feed is of a certain vintage, etc. Meanwhile, relying on a source of truth that is external leaves one vulnerable to the OP's listed risks, including a required internet connection and risk of results being inconsistent because the downloaded GTFS feed was not that same as on previous query responses.

Quick pseudocode example of downloading from an external link before reading latest feed into partridge.

def download_to_file():
    r = requests.get('http://some.special.endpoint.com/gtfs')
    p = f'{tempfile.mkdtemp()}/gtfs.zip'
    open(p, 'wb').write(r.content)
    return p

latest_feed_path = download_to_file()
feed = ptg.feed(latest_feed_path)
invisiblefunnel commented 6 years ago

Ya, a lot of this depends on the use case/workflows. Both of you make good points.

invisiblefunnel commented 5 years ago

I've decided against this feature for now. It is a bit out of scope and I want to keep maintenance to a minimum. fwiw I use curl -L -O https://example.com/gtfs.zip from the command line.