jGaboardi / geopackage_tutorial

Quick tutorial for reading and writing GeoPackages
BSD 3-Clause "New" or "Revised" License
6 stars 0 forks source link

[ENH] Query GeoPackages? #1

Open darribas opened 4 years ago

darribas commented 4 years ago

Fantastic resource! Thanks very much for putting it together. This prompted me a question I've had for a while about GeoPackage as a format. Since it's essentially a SQLite file with some delimited structure, do you know if it's possible/how to query a GeoPackage as a SQL file? At the very least, it should be possible to pull out records matching some attribute condition (e.g. SELECT * WHERE Column_1 == "x"). In an ideal world, you might also be able to do spatial queries (e.g. points within a 500m radious of point x, etc.). Do you know if any of this is possible and if so how? I think it'd make an awesome addition to the resource.

jGaboardi commented 4 years ago

I actually am a complete beginner with geopackages. Any contributions are welcome!

darribas commented 4 years ago

Yeah I have no clue... Maybe @jorisvandenbossche or @ljwolf???

jGaboardi commented 4 years ago

Though not Python related, .gpkg files can easily be queried with Firefox's SQLite Manager. Just drag and drop in the .gpkg and query. With the sample-geopackage.gpkg:

PRAGMA table_info(points)
cid name type notnull dflt_value pk
1 0 fid INTEGER 1   1
2 1 geom POINT 0   0
3 2 ID TEXT 0   0
SELECT * FROM points
  fid geom ID
1 1 71,80,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 one
2 2 71,80,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,240,63,0,0,0,0,0,0,0,64 two
3 3 71,80,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,64,0,0,0,0,0,0,0,0 three
SELECT * FROM points WHERE ID == "one"
fid geom ID
1 1 71,80,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 one
jGaboardi commented 4 years ago

... though this does seem to be giving strange results in geometry... the geopackage seems to have gotten projected.

martinfleis commented 4 years ago

In theory, it should be possible to query gpkg directly from Geopandas. In gpd.read_file, Fiona passes kwargs as driver-specific parameters that will be interpreted by the OGR library as layer creation or opening options (docs). And GDAL GPKG driver mentions SQL. But I was not able to do so, but I am not sure what should be the exact syntax for that. This below seems to ignore the query.

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
cities = gpd.read_file(gpd.datasets.get_path('naturalearth_cities'))

world.to_file('package.gpkg', layer='world')
cities.to_file('package.gpkg', layer='cities')

gpd.read_file('package.gpkg', layer='world', sql='SELECT * WHERE pop_est > 10000000')

Maybe @sgillies will know.

darribas commented 4 years ago

cool! The GDAL page also mentions a link to Spatialite. It'd be super nice if that allowed you to query from read_file ton only read from disk the rows you really wanted.

martinfleis commented 4 years ago

@darribas spatially you can now filter using bbox to load only part of the geometry within set bounding box (or extent of another geoSeries). That should work with any filetypes. Since geopandas 0.7.0 you'll be able to filer by mask or filter rows (https://github.com/geopandas/geopandas/pull/1160). But figuring out how to query gkpg would be way cooler :).

ljwolf commented 4 years ago

Yeah, I think you should be able to use that query in theory, but I'm not getting any results using that from geopandas or from fiona directly.

I though maybe you needed to force the sql to query against the layer, sql = 'select * from world where pop_est > 10000000', but that still fails.

the fiona documentation suggests that you should use the builtin filter() rather than an attribute filter from OGR, so we could try to add that to geopandas here using similar logic to the bbox filter: The user defines a lambda, say

geopandas.read_file('package.gpkg', ..., filter = lambda feature: feature[
'properties']['pop_est'] > 1000000`)

and we evaluate that lambda against the feature in the input to from_features:

from builtins import filter as filter_items
gdf = GeoDataFrame.from_features(filter_items(f_filt, filter), crs=crs, columns=columns)

We'd still read each feature, but not all at once, since that filter_items call would still be lazy.

sgillies commented 4 years ago

@martinfleis Fiona has no support for OGR's SQL. There are only a few formats that support the attribute indexing that would make those attribute queries much faster than any filter you'd write in Python or do with Pandas. GeoPackage is one, but I'm not inclined to add to change Fiona for this case. Supporting Fiona's existing features is a lot of work already and I'm reluctant to add more.

martinfleis commented 4 years ago

@ljwolf That looks cool. Do you want to move this discussion to geopandas repo? It could work nicely with the PR providing support of row filtering and masking as a bigger update of selective reading capabilities.

darribas commented 4 years ago

Just for my memory, I'll leave here a couple of links I've come across related:

Just a though: given GeoPackages are a special file format in that they can be seen as a file or a db (they're both), would it make sense to make it easy to exploit the db side of it through read_postgis? The workflow being here if you just want to read a whole table you'd use read_file, if you want a more complicated query that potentially involves spatialite, you'd do it through read_postgis?

ljwolf commented 4 years ago

@martinfleis definitely!

brendan-ward commented 4 years ago

Note that the raw bytes stored in the Geopackage for geometries are a header plus WKB, per the spec: http://www.geopackage.org/spec/#gpb_format

This makes them really awkward to use in simple raw SQL queries like the above.

You either need to query them with conversion back to spatial types using a spatial engine that can handle that conversion (looks like both ogr and spatialite do that for you), or you have to handle that conversion yourself outside the database.

In case it is useful, I have a very rudimentary pass at writing geopackages from pygeos geometries here. This might be useful for future versions of geopandas that use pygeos geometries internally, if it seemed appropriate to write custom I/O for geopackages. While it established it is possible to get faster I/O using pygeos and numpy to handle the conversion to the geometry encoding format above, it has a bunch of rough edges (some of which likely need to move down to a C extension / Cython level).