pysal / libpysal

Core components of Python Spatial Analysis Library
http://pysal.org/libpysal
Other
268 stars 78 forks source link

Pulling example datasets from Carto #306

Open stuartlynn opened 8 years ago

stuartlynn commented 8 years ago

Opening this ticket to explore an idea that @ljwolf and myself had chatted briefly about. For the example datasets that are used in pysal, could these be maintained externally and just pulled by the library when required and cached locally? It's really easy to pull a Carto table directly in to a pandas dataframe using our SQL API so it might be a natural fit to store some of those data sources in Carto?

This would be similar to the approach scikit takes with grabbing example datasets.

talos commented 8 years ago

I love this idea!

We're doing a lot of ETL work right now in https://github.com/cartodb/bigmetadata, and one of the outputs we support are tables on Carto.

I'd love to do a one-off PoC of hosting on Carto, mainly thinking about the downstream side (getting the data from Carto.)

If that looks good, we could think about handling the upstream side as part of our open source ETL, which would mean easier reproduceability and execution.

ljwolf commented 8 years ago

This would be really cool, and definitely make it easier to show functionality. I know I run to south.dbf out of familiarity... And I don't even know what many of the columns mean because there's no metadata on it.

Most of the time I spent writing cenpy was in figuring out how to automatically scrape and format Metadata about census products. Once that's written, it should just be writing a simple query builder for the public sql api right? I had designs of wrapping up common queries with cenpy and adding to pysal, but just haven't had the time. If there's interest in building an examples.fetch function around carto, I'm game to contribute/review.

Should the internals be maintained by pysal, though? Maybe a separate, conditional dependency makes more sense, like cartodb-python? I may, in drier times, be wary of the long run maintenance costs, with how dynamic y'all systems are :grinning:

ljwolf commented 8 years ago

I mean, is it as simple as pointing pd.read_table to fixed targets?

sjsrey commented 8 years ago

Would be another nice way to collaborate between the two projects.

I agree with @ljwolf that a conditional dependency approach makes the most sense as an initial exploration. We could think about a couple of options, 1) using the implementation to have pysal extend its example datasets to include a carto data set, if the conditional import is there. 2) we collaborate on building a new dataset that lives at carto and use that to test out an implementation.

There are probably other routes to explore as well, but these are some initial thoughts.

talos commented 8 years ago

I like the idea of a conditional dependency -- I think this code would be better maintained outside of mainline pysal.

How's this for a roadmap:

  1. Put the most-used pysal datasets (stuff like south.dbf) on Carto. Put together a very simple shell of a Python module that can pull these down.
  2. Identify a few new datasets that would be great to have easy access to.
  3. Identify connections between those datasets and data like the ACS, QCEW, LEHD/LODES etc. that are already in the Observatory.
  4. Write a bit of metadata for those new datasets.
  5. Write some functions in the Python module that make the appropriate calls to Carto to pull down datasets that may be a mix of these newly documented datasets and the ACS, QCEW, etc.
  6. Profit (or the pysal equivalent).
sjsrey commented 8 years ago

A PR to interop with Carto Observatory would be fantastic.