simonw / sqlite-utils

Python CLI utility and library for manipulating SQLite databases
https://sqlite-utils.datasette.io
Apache License 2.0
1.63k stars 111 forks source link

Make it easier to insert geometries, with documentation and maybe code #399

Open simonw opened 2 years ago

simonw commented 2 years ago

In playing with the new SpatiaLite helpers from #385 I noticed that actually populating geometry columns is still a little bit tricky. Here's what I ended up doing:

import httpx, sqlite_utils
db = sqlite_utils.Database("/tmp/spatial.db")
attractions = httpx.get("https://latest.datasette.io/fixtures/roadside_attractions.json?_shape=array").json()
db["attractions"].insert_all(attractions, pk="pk")

# Schema of that table is now:
# CREATE TABLE [attractions] (
#    [pk] INTEGER PRIMARY KEY,
#    [name] TEXT,
#    [address] TEXT,
#    [latitude] FLOAT,
#    [longitude] FLOAT
# )

db.init_spatialite()
db["attractions"].add_geometry_column("point", "POINT")

db.execute("""
    update attractions set point = GeomFromText(
      'POINT(' || longitude || ' ' || latitude || ')', 4326
    )
""")

That last line took some figuring out - especially the need for the SRID of 4326, without which I got this error:

IntegrityError: attractions.point violates Geometry constraint [geom-type or SRID not allowed]

It would be good to both document this in more detail, but ideally also to come up with a more obvious pattern for inserting common types of spatial data.

Also related:

simonw commented 2 years ago

The conversions= argument to .insert() and friends is designed to handle this case, but I don't think it's very elegant: https://sqlite-utils.datasette.io/en/stable/python-api.html#converting-column-values-using-sql-functions

db["places"].insert(
    {"name": "Wales", "geometry": wkt},
    conversions={"geometry": "GeomFromText(?, 4326)"},
)
simonw commented 2 years ago

I wonder what the most developer-friendly way to insert geometry data into SpatiaLite is?

From https://www.gaia-gis.it/gaia-sins/spatialite-sql-latest.html it looks like these are the main options:

Interesting that some accept an SRID and others do not - presumably GeomFromGeoJSON() always uses SRID=4326?

simonw commented 2 years ago

I can't seem to get GeomFromGeoJSON() to work - example:

https://calands.datasettes.com/calands?sql=select+IsValid%28SetSRID%28GeomFromGeoJSON%28%27%7B%0D%0A++++++++%22type%22%3A+%22Point%22%2C%0D%0A++++++++%22coordinates%22%3A+%5B%0D%0A++++++++++-94.921875%2C%0D%0A++++++++++45.460130637921004%0D%0A++++++++%5D%0D%0A++++++%7D%27%29%2C+4326%29%29

select IsValid(SetSRID(GeomFromGeoJSON('{
        "type": "Point",
        "coordinates": [
          -94.921875,
          45.460130637921004
        ]
      }'), 4326))

Returns -1 suggesting the geometry is not valid. Just doing this (with or without that SetSRID() function) returns null:

select SetSRID(GeomFromGeoJSON('{
        "type": "Point",
        "coordinates": [
          -94.921875,
          45.460130637921004
        ]
      }'), 4326)
simonw commented 2 years ago

Wow, it was the newlines that broke it! This works fine:

select AsWKT(SetSRID(GeomFromGeoJSON('{"type": "Point","coordinates": [-94.921875,45.460130637921004]}'), 4326))

https://calands.datasettes.com/calands?sql=select+AsWKT%28SetSRID%28GeomFromGeoJSON%28%27%7B%22type%22%3A+%22Point%22%2C%22coordinates%22%3A+%5B-94.921875%2C45.460130637921004%5D%7D%27%29%2C+4326%29%29

And removing SetSRID() returns exactly the same result:

https://calands.datasettes.com/calands?sql=select+AsWKT%28GeomFromGeoJSON%28%27%7B%22type%22%3A+%22Point%22%2C%22coordinates%22%3A+%5B-94.921875%2C45.460130637921004%5D%7D%27%29%29

simonw commented 2 years ago

I'm trying to think of ways to make this nicer from the perspective of someone calling the .insert() or .insert_all() methods against a table that has geometry columns.

One option would be for the code to introspect the table (if it exists) before running the insert, looking for any geometry columns.

This introspection isn't easy! The table schema just gives you "name_of_column" point or similar - to figure out the SRID and suchlike you need to consult the geometry_columns table, I think - which throws a 500 error on https://calands.datasettes.com/calands/geometry_columns for some reason. Also does the shape of that table change between SpatiaLite versions?

Assuming we can introspect the table, what would we do with that information? We could add code that detects if the user attempted to pass GeoJSON objects and automatically inserts a GeomFromGeoJSON() function call - but detecting GeoJSON is a bit weird, and GeoJSON also isn't necessarily the nicest format for populating e.g. latitude/longitude points.

Maybe we just support the simplest possible case: a tuple of floats, which we assume is latitude, longitude (or should we expect longitude, latitude, the eternal debate?) - if those are used against a geometry table (especially a point table) we assume they are coordinates that need to be converted using GeomFromText('POINT(....

Not crazy about either of these ideas. Is there something better?

simonw commented 2 years ago

Useful thoughts on Twitter regarding making coordinate pairs easy and more complex shapes possible:

https://twitter.com/dbreunig/status/1490099303888547843

That is exactly where I was going: two modes.

  1. Heuristics and assumptions to get coordinates as a pair (in tuple) or as columns (look for lat, lon, latitude, longitude, etc).
  2. GIS mode with projections, polys, etc

Make it easy for people with csvs of coordinates. If you're using Geojson or shp files, you have to specify.

simonw commented 2 years ago

Here's an idea for an API design:

geojson_geometry = {} # ... GeoJSON goes here
db["places"].insert(
    {"name": "Wales", "geometry": geojson_geometry},
    geojson="geometry"
)

That geojson= parameter takes either a single column name or an iterable of column names. Any column in that list is expected to be a compatible geometry and the correct conversion functions will be applied.

That solves for GeoJSON, but it's a bit ugly. Should I add wkt= and maybe even kml= and gml= and so-on too? Definitely not, that's way too many ugly and inscrutable new parameters.

More importantly: if I want to support the following how would I do it?

db["places"].insert(
    {"name": "London", "point": (51.509865, -0.118092)}
)

Here I want to provide a (latitude, longitude) pair and have it inserted correctly into a point column.

Could do this, but again it's messy:

db["places"].insert(
    {"name": "London", "point": (51.509865, -0.118092)},
    point="point"
)

And again, what about those (longitude, latitude) people?

simonw commented 2 years ago

Maybe I should leave this entirely up to documented patterns in the conversions={} dictionary?

But even that's not ideal for the co-ordinate case. Consider the following:

db["places"].insert(
    {"name": "London", "point": (51.509865, -0.118092)},
    conversions={"point": "GeomFromText(?, 4326)"},
)

The challenge here is that the SpatiaLite function GeomFromText() expects a WKT string, which looks like this:

POINT(-0.118092 51.509865)

The existing conversions= mechanism doesn't support applying Python code to convert the (lat, lon) tuple to that value.

It doesn't even support passing a Python tuple as a ? parameter - so I don't think I could come up with a SQL string that would do the right thing here either.

simonw commented 2 years ago

So maybe back to that earlier idea where the code introspects the table, figures out that "point" is a geometry table of type POINT, then applies the necessary conversions to the raw Python data?

That feels overly-complicated to me, especially since nothing else in the .insert() method currently relies on table introspection.

simonw commented 2 years ago

Another idea: introduce a helper function transform pattern, something a bit like this:


transformer = make_transformer({
  "point": lambda pair: "POINT({} {})".format(pair[1], pair[0])
})

db["places"].insert_all(
    transformer([{"name": "London", "point": (51.509865, -0.118092)}])
    conversions={"point": "GeomFromText(?, 4326)"},
)

The make_transformer(...) function builds an object that can work as a wrapping iterator, applying those transform functions to everything in the sequence that it wraps.

So the above code would handle converting (lat, lon) to POINT(lon lat) - then the conversions= applies GeomFromText.

Naming is a challenge here: .transform() and .convert() and conversions= all have existing meanings within the sqlite-utils Python library.

It's also a bit of a messy way of solving this. It's not exactly a smooth API for inserting a bunch of lat/lon coordinate pairs!

simonw commented 2 years ago

Note that GeoJSON itself uses (longitude, latitude) so I should probably stick to that order here too. https://datatracker.ietf.org/doc/html/rfc7946#section-3.1.1

simonw commented 2 years ago

Here's the definitive guide to latitude, longitude v.s. longitude, latitude: https://macwright.com/lonlat/

Which is right?

Neither. This is an opinion with no right answer. Geographical tradition favors lat, lon. Math and software prefer lon, lat.

I asked on Twitter here: https://twitter.com/simonw/status/1490148001569906688

simonw commented 2 years ago

Another idea - my favourite option so far:

from sqlite_utils.utils import LongitudeLatitude

db["places"].insert(
    {
        "name": "London",
        "point": (-0.118092, 51.509865)
    },
    conversions={"point": LongitudeLatitude},
)

Here LongitudeLatitude is a magical value which does TWO things: it sets up the GeomFromText(?, 4326) SQL function, and it handles converting the (51.509865, -0.118092) tuple into a POINT({} {}) string.

This would involve a change to the conversions= contract - where it usually expects a SQL string fragment, but it can also take an object which combines that SQL string fragment with a Python conversion function.

Best of all... this resolves the lat, lon v.s. lon, lat dilemma because you can use from sqlite_utils.utils import LongitudeLatitude OR from sqlite_utils.utils import LatitudeLongitude depending on which you prefer!

simonw commented 2 years ago

Thinking about types. The type of the conversions parameter right now is a bit lazy:

conversions: Optional[dict] = None,

That becomes:

Optional[Dict[str, Union[str, Conversion]]]

Where Conversion is an abstract base class which expects implementations to have a .sql() -> str and a .convert(value) -> str method.

eyeseast commented 2 years ago

I like the idea of having stock conversions you could import. I'd actually move them to a dedicated module (call it sqlite_utils.conversions or something), because it's different from other utilities. Maybe they even take configuration, or they're composable.

from sqlite_utils.conversions import LongitudeLatitude

db["places"].insert(
    {
        "name": "London",
        "lng": -0.118092,
        "lat": 51.509865,
    },
    conversions={"point": LongitudeLatitude("lng", "lat")},
)

I would definitely use that for every CSV I get with lat/lng columns where I actually need GeoJSON.

simonw commented 2 years ago

That example you have there is really neat - I like the idea that they can also be used to populate completely new columns that are derived from the other column inputs.

eyeseast commented 2 years ago

All this said, I don't think it's unreasonable to point people to dedicated tools like geojson-to-sqlite. If I'm dealing with a bunch of GeoJSON or Shapefiles, I need to something to read those anyway (or I need to figure out virtual tables). But something like this might make it easier to build those libraries, or standardize the underlying parts.

simonw commented 2 years ago

I wonder if there are any interesting non-geospatial canned conversions that it would be worth including?

simonw commented 2 years ago

Yeah, having this be a general purpose mechanism which has a few canned examples for handling geospatial stuff is a lot neater than having a mechanism for this that's exclusive to SpatiaLite.

eyeseast commented 2 years ago

I wonder if there are any interesting non-geospatial canned conversions that it would be worth including?

Off the top of my head:

Some of this is easy enough with SQL functions, some is easier in Python. Maybe that's where having pre-built classes gets really handy, because it saves you from thinking about which way it's implemented.

chris48s commented 2 years ago

Interesting that some accept an SRID and others do not - presumably GeomFromGeoJSON() always uses SRID=4326?

The ewtk/ewkb ones don't accept an SRID is because ewkt encodes the SRID in the string, so you would do this with a wkt string:

GeomFromText('POINT(529090 179645)', 27700)

but for ewkt it would be

GeomFromEWKT('SRID=27700;POINT(529090 179645)')

The specs for KML and GeoJSON specify a Coordinate Reference System for the format

GML can specify the SRID in the XML at feature level e.g:

<gml:Point srsName="EPSG:27700">
  <gml:coordinates>529090, 179645</gml:coordinates>
</gml:Point>

There's a few more obscure formats in there, but broadly I think it is safe to assume an SRID param exists on the function for cases where the SRID is not implied by or specified in the input format.

simonw commented 2 years ago

I wonder if I could implement the above such that this also works:

db["places"].insert(
    {
        "name": "London",
        "point": LongitudeLatitude(-0.118092, 51.509865)
    }
)

This feels like a very natural way to work with single inserts.

The challenge is writing the code inside .insert_all() such that it can handle these special objects in the input column values in addition to them being passed in conversions=.

I'm feeling very good about this direction in general though, it feels like it takes the existing but not particularly elegant conversions= mechanism and upgrades it to be far more useful, while maintaining backwards compatibility.

simonw commented 2 years ago

Moving the design of this new Conversion subclass mechanism to:

eyeseast commented 2 years ago

Coming back to this. I was about to add a utility function to [datasette-geojson]() to convert lat/lng columns to geometries. Thankfully I googled first. There's a SpatiaLite function for this: MakePoint.

select MakePoint(longitude, latitude) as geometry from places;

I'm not sure if that would work with conversions, since it needs two columns, but it's an option for tables that already have latitude, longitude columns.

chrislkeller commented 1 year ago

Using this thread and some other resources I managed to cobble together a couple of sqlite-utils lines to add a geometry column for a table that already has a lat/lng column.

# add a geometry column
sqlite-utils add-geometry-column [db name] [table name] geometry --type POINT --srid 4326

# add a point for each row to geometry column
sqlite-utils --load-extension=spatialite [db name] 'update [table name] SET Geometry=MakePoint(longitude, latitude, 4326);'