Open simonw opened 2 years ago
The conversions=
argument to .insert()
and friends is designed to handle this case, but I don't think it's very elegant: https://sqlite-utils.datasette.io/en/stable/python-api.html#converting-column-values-using-sql-functions
db["places"].insert(
{"name": "Wales", "geometry": wkt},
conversions={"geometry": "GeomFromText(?, 4326)"},
)
I wonder what the most developer-friendly way to insert geometry data into SpatiaLite is?
From https://www.gaia-gis.it/gaia-sins/spatialite-sql-latest.html it looks like these are the main options:
GeomFromText( wkt String [ , SRID Integer] )
GeomFromWKB( wkbGeometry Binary [ , SRID Integer] )
GeomFromKml( KmlGeometry String )
GeomFromGML( gmlGeometry String )
GeomFromGeoJSON( geoJSONGeometry String )
GeomFromEWKB( ewkbGeometry String )
GeomFromEWKT( ewktGeometry String )
GeomFromFGF( fgfGeometry Binary [ , SRID Integer] )
GeomFromTWKB( twkbGeometry BLOB [ , SRID Integer] )
GeomFromGPB( geom GPKG Blob Geometry )
- GeoPackage formatGeomFromExifGpsBlob( image BLOB )
Interesting that some accept an SRID and others do not - presumably GeomFromGeoJSON()
always uses SRID=4326?
I can't seem to get GeomFromGeoJSON()
to work - example:
select IsValid(SetSRID(GeomFromGeoJSON('{
"type": "Point",
"coordinates": [
-94.921875,
45.460130637921004
]
}'), 4326))
Returns -1
suggesting the geometry is not valid. Just doing this (with or without that SetSRID()
function) returns null:
select SetSRID(GeomFromGeoJSON('{
"type": "Point",
"coordinates": [
-94.921875,
45.460130637921004
]
}'), 4326)
Wow, it was the newlines that broke it! This works fine:
select AsWKT(SetSRID(GeomFromGeoJSON('{"type": "Point","coordinates": [-94.921875,45.460130637921004]}'), 4326))
And removing SetSRID()
returns exactly the same result:
I'm trying to think of ways to make this nicer from the perspective of someone calling the .insert()
or .insert_all()
methods against a table that has geometry columns.
One option would be for the code to introspect the table (if it exists) before running the insert, looking for any geometry columns.
This introspection isn't easy! The table schema just gives you "name_of_column" point
or similar - to figure out the SRID and suchlike you need to consult the geometry_columns
table, I think - which throws a 500 error on https://calands.datasettes.com/calands/geometry_columns for some reason. Also does the shape of that table change between SpatiaLite versions?
Assuming we can introspect the table, what would we do with that information? We could add code that detects if the user attempted to pass GeoJSON objects and automatically inserts a GeomFromGeoJSON()
function call - but detecting GeoJSON is a bit weird, and GeoJSON also isn't necessarily the nicest format for populating e.g. latitude/longitude points.
Maybe we just support the simplest possible case: a tuple of floats, which we assume is latitude, longitude
(or should we expect longitude, latitude
, the eternal debate?) - if those are used against a geometry table (especially a point table) we assume they are coordinates that need to be converted using GeomFromText('POINT(...
.
Not crazy about either of these ideas. Is there something better?
Useful thoughts on Twitter regarding making coordinate pairs easy and more complex shapes possible:
https://twitter.com/dbreunig/status/1490099303888547843
That is exactly where I was going: two modes.
- Heuristics and assumptions to get coordinates as a pair (in tuple) or as columns (look for lat, lon, latitude, longitude, etc).
- GIS mode with projections, polys, etc
Make it easy for people with csvs of coordinates. If you're using Geojson or shp files, you have to specify.
Here's an idea for an API design:
geojson_geometry = {} # ... GeoJSON goes here
db["places"].insert(
{"name": "Wales", "geometry": geojson_geometry},
geojson="geometry"
)
That geojson=
parameter takes either a single column name or an iterable of column names. Any column in that list is expected to be a compatible geometry
and the correct conversion functions will be applied.
That solves for GeoJSON, but it's a bit ugly. Should I add wkt=
and maybe even kml=
and gml=
and so-on too? Definitely not, that's way too many ugly and inscrutable new parameters.
More importantly: if I want to support the following how would I do it?
db["places"].insert(
{"name": "London", "point": (51.509865, -0.118092)}
)
Here I want to provide a (latitude, longitude)
pair and have it inserted correctly into a point
column.
Could do this, but again it's messy:
db["places"].insert(
{"name": "London", "point": (51.509865, -0.118092)},
point="point"
)
And again, what about those (longitude, latitude)
people?
Maybe I should leave this entirely up to documented patterns in the conversions={}
dictionary?
But even that's not ideal for the co-ordinate case. Consider the following:
db["places"].insert(
{"name": "London", "point": (51.509865, -0.118092)},
conversions={"point": "GeomFromText(?, 4326)"},
)
The challenge here is that the SpatiaLite function GeomFromText()
expects a WKT string, which looks like this:
POINT(-0.118092 51.509865)
The existing conversions=
mechanism doesn't support applying Python code to convert the (lat, lon)
tuple to that value.
It doesn't even support passing a Python tuple as a ?
parameter - so I don't think I could come up with a SQL string that would do the right thing here either.
So maybe back to that earlier idea where the code introspects the table, figures out that "point"
is a geometry table of type POINT, then applies the necessary conversions to the raw Python data?
That feels overly-complicated to me, especially since nothing else in the .insert()
method currently relies on table introspection.
Another idea: introduce a helper function transform pattern, something a bit like this:
transformer = make_transformer({
"point": lambda pair: "POINT({} {})".format(pair[1], pair[0])
})
db["places"].insert_all(
transformer([{"name": "London", "point": (51.509865, -0.118092)}])
conversions={"point": "GeomFromText(?, 4326)"},
)
The make_transformer(...)
function builds an object that can work as a wrapping iterator, applying those transform functions to everything in the sequence that it wraps.
So the above code would handle converting (lat, lon)
to POINT(lon lat)
- then the conversions=
applies GeomFromText
.
Naming is a challenge here: .transform()
and .convert()
and conversions=
all have existing meanings within the sqlite-utils
Python library.
It's also a bit of a messy way of solving this. It's not exactly a smooth API for inserting a bunch of lat/lon coordinate pairs!
Note that GeoJSON itself uses (longitude, latitude)
so I should probably stick to that order here too. https://datatracker.ietf.org/doc/html/rfc7946#section-3.1.1
Here's the definitive guide to latitude, longitude
v.s. longitude, latitude
: https://macwright.com/lonlat/
Which is right?
Neither. This is an opinion with no right answer. Geographical tradition favors lat, lon. Math and software prefer lon, lat.
I asked on Twitter here: https://twitter.com/simonw/status/1490148001569906688
Another idea - my favourite option so far:
from sqlite_utils.utils import LongitudeLatitude
db["places"].insert(
{
"name": "London",
"point": (-0.118092, 51.509865)
},
conversions={"point": LongitudeLatitude},
)
Here LongitudeLatitude
is a magical value which does TWO things: it sets up the GeomFromText(?, 4326)
SQL function, and it handles converting the (51.509865, -0.118092)
tuple into a POINT({} {})
string.
This would involve a change to the conversions=
contract - where it usually expects a SQL string fragment, but it can also take an object which combines that SQL string fragment with a Python conversion function.
Best of all... this resolves the lat, lon
v.s. lon, lat
dilemma because you can use from sqlite_utils.utils import LongitudeLatitude
OR from sqlite_utils.utils import LatitudeLongitude
depending on which you prefer!
Thinking about types. The type of the conversions
parameter right now is a bit lazy:
conversions: Optional[dict] = None,
That becomes:
Optional[Dict[str, Union[str, Conversion]]]
Where Conversion
is an abstract base class which expects implementations to have a .sql() -> str
and a .convert(value) -> str
method.
I like the idea of having stock conversions you could import. I'd actually move them to a dedicated module (call it sqlite_utils.conversions
or something), because it's different from other utilities. Maybe they even take configuration, or they're composable.
from sqlite_utils.conversions import LongitudeLatitude
db["places"].insert(
{
"name": "London",
"lng": -0.118092,
"lat": 51.509865,
},
conversions={"point": LongitudeLatitude("lng", "lat")},
)
I would definitely use that for every CSV I get with lat/lng columns where I actually need GeoJSON.
That example you have there is really neat - I like the idea that they can also be used to populate completely new columns that are derived from the other column inputs.
All this said, I don't think it's unreasonable to point people to dedicated tools like geojson-to-sqlite
. If I'm dealing with a bunch of GeoJSON or Shapefiles, I need to something to read those anyway (or I need to figure out virtual tables). But something like this might make it easier to build those libraries, or standardize the underlying parts.
I wonder if there are any interesting non-geospatial canned conversions that it would be worth including?
Yeah, having this be a general purpose mechanism which has a few canned examples for handling geospatial stuff is a lot neater than having a mechanism for this that's exclusive to SpatiaLite.
I wonder if there are any interesting non-geospatial canned conversions that it would be worth including?
Off the top of my head:
urlparse
Some of this is easy enough with SQL functions, some is easier in Python. Maybe that's where having pre-built classes gets really handy, because it saves you from thinking about which way it's implemented.
Interesting that some accept an SRID and others do not - presumably GeomFromGeoJSON() always uses SRID=4326?
The ewtk/ewkb ones don't accept an SRID is because ewkt encodes the SRID in the string, so you would do this with a wkt string:
GeomFromText('POINT(529090 179645)', 27700)
but for ewkt it would be
GeomFromEWKT('SRID=27700;POINT(529090 179645)')
The specs for KML and GeoJSON specify a Coordinate Reference System for the format
GML can specify the SRID in the XML at feature level e.g:
<gml:Point srsName="EPSG:27700">
<gml:coordinates>529090, 179645</gml:coordinates>
</gml:Point>
There's a few more obscure formats in there, but broadly I think it is safe to assume an SRID param exists on the function for cases where the SRID is not implied by or specified in the input format.
I wonder if I could implement the above such that this also works:
db["places"].insert(
{
"name": "London",
"point": LongitudeLatitude(-0.118092, 51.509865)
}
)
This feels like a very natural way to work with single inserts.
The challenge is writing the code inside .insert_all()
such that it can handle these special objects in the input column values in addition to them being passed in conversions=
.
I'm feeling very good about this direction in general though, it feels like it takes the existing but not particularly elegant conversions=
mechanism and upgrades it to be far more useful, while maintaining backwards compatibility.
Moving the design of this new Conversion
subclass mechanism to:
Coming back to this. I was about to add a utility function to [datasette-geojson]() to convert lat/lng columns to geometries. Thankfully I googled first. There's a SpatiaLite function for this: MakePoint.
select MakePoint(longitude, latitude) as geometry from places;
I'm not sure if that would work with conversions
, since it needs two columns, but it's an option for tables that already have latitude, longitude columns.
Using this thread and some other resources I managed to cobble together a couple of sqlite-utils lines to add a geometry column for a table that already has a lat/lng column.
# add a geometry column
sqlite-utils add-geometry-column [db name] [table name] geometry --type POINT --srid 4326
# add a point for each row to geometry column
sqlite-utils --load-extension=spatialite [db name] 'update [table name] SET Geometry=MakePoint(longitude, latitude, 4326);'
In playing with the new SpatiaLite helpers from #385 I noticed that actually populating geometry columns is still a little bit tricky. Here's what I ended up doing:
That last line took some figuring out - especially the need for the SRID of
4326
, without which I got this error:It would be good to both document this in more detail, but ideally also to come up with a more obvious pattern for inserting common types of spatial data.
Also related:
398
79