intake / intake_geopandas

An intake plugin for loading datasets with geopandas
BSD 2-Clause "Simplified" License
15 stars 7 forks source link

Load from non-shapefiles #1

Closed ian-r-rose closed 5 years ago

ian-r-rose commented 5 years ago

I'm interested in using intake to load into geopandas dataframes, and am still trying to wrap my head around idiomatic usage. This driver seems to be oriented around loading shapefiles. Geopandas, however, is also able to load GeoJSON files and PostGIS databases. My understanding of intake drivers are that they are oriented around loading from specifc data kinds, and putting the data into one of a small number of containers.

If that is the case, would I want to write a different driver for GeoJSON or PostGIS? Or would we be able to adapt this driver to handle those cases? Could this driver be alternatively named intake_shapefile?

Any thoughts would be greatly appreciated.

jacobtomlinson commented 5 years ago

Thanks for raising this @ian-r-rose.

My understanding is that the plugin reflects the final container (in this case geopandas) and has one or more sources which it can load. This module has the shape file source implemented, but I would be keen to add more.

The xarray plugin for intake seems to separate out its multiple sources into separate files. https://github.com/intake/intake-xarray/tree/master/intake_xarray.

Therefore to answer your question I think we should extend this driver to include sources for GeoJSON and PostGIS.

As most of the functionality is going to be identical we could create an abstract class which implements most of the logic and then just subclass it in order to register the multiple supported sources.

ian-r-rose commented 5 years ago

Thanks for the quick response @jacobtomlinson.

I'm still a bit puzzled about this. Based on flipping through the examples, it looks like the intake-xarray driver is a bit unusual in that it registers a new container type of 'xarray'.

Many of the other examples seem to take data from more specific data sources and put them into more generic containers (so the drivers for spark, parquet, and sql all dump into 'dataframe' containers). So I guess a related question is: is a geodataframe different enough to justify a new container type?

Happy to help with extending this as necessary!

jacobtomlinson commented 5 years ago

Interesting questions. I don't really have an answer. Perhaps @martindurant or @jsignell would have some advice?

martindurant commented 5 years ago

You basic understanding around Intake's model is correct, we want a limited number of container types, but many drivers. In the case of xarray, there are many drivers which produce the one new type of container. In the case here, it might be reasonable to have a geopandas-specific container, but the more generic dataframe one is probably fine for all practical purposes. This is largely up to the preferences of developers.

The current implementation here essentially does geopandas.read_file and has name "shape". I suppose that name could be more specific ("geopandas_shape"?). I don't know geopandas well, so I don't know the range of files that it could read.

When it comes to pGIS, it seems to me that is indeed a different driver, because it uses a different method (in the same library) with a different set of arguments - and it conceptually different in what it does. That driver should probably live in this library, though.

By the way, intake-postgres does contain some references to GIS and may be worth a look. I don't think it's particularly fit for purpose, as it is based on the ancient postgres-adapter. On the other hand, parsing WKB in python is probably slow, so there may well be a lot of room for improvement - but that can happen within geopandas.

ian-r-rose commented 5 years ago

One point where it seems to me like it might be good to have a geodataframe-specific-container: persisting/exporting data. In that case, it is probably better to use one of the OGR-supported drivers for writing to disk, rather than the default of parquet.