systemed / tilemaker

Make OpenStreetMap vector tiles without the stack
https://tilemaker.org/
Other
1.42k stars 228 forks source link

support GeoJSON lines format #651

Closed cldellow closed 6 months ago

cldellow commented 7 months ago

Described at https://stevage.github.io/ndgeojson/; each feature is given its own line, rather than being wrapped in a FeatureCollection.

This format is slightly easier to generate and plays a bit nicer with source control diffs. It's also possible to parallelize the parsing of such a file, although I haven't tried to do that in this PR.

The Daylight Map Distribution uses this format for their lake centrelines: https://daylightmap.org/2023/03/20/water-labels.html

I've also got a not-ready-for-primetime branch of tilemaker that exposes a GeoJSON() function to Lua, which I'm using to generate .geojsonl files for further processing (e.g., to use on subsequent runs as inputs to tilemaker's Intersects functions, or to use as input when deriving an SQLite autosuggest database)

systemed commented 6 months ago

Neat idea!

It's also possible to parallelize the parsing of such a file, although I haven't tried to do that in this PR.

I guess the major win would be moving the boost::asio::post into the file reading loop. That way we wouldn't need to construct the deque of JSON docs, which will be pretty trivial in the lake centreline case (small file) but could be useful in cases of very large files.

Maybe worth detecting .(geo)jsonseq extensions too? That appears to be what's being used by the Daylight file and is in https://datatracker.ietf.org/doc/html/rfc7464.

cldellow commented 6 months ago

Good feedback! Added support for files ending in jsonseq and parallelized the reading.

A possible area for future improvement: be able to prune prior to building the MultiPolygon and correcting it. I think it'd be safe to create a naive Box using the largest/smallest lat/lons to do a test to definitively rule out (but not rule in) whether a geometry should be loaded.

cldellow commented 6 months ago

Reading RFC 7464 and RFC 8142 a bit more closely, they also contemplate some behaviours that this PR does not currently support:

  1. use ASCII record separator (RS, 0x1e) as a mandatory separator
  2. permit internal newlines (e.g. when the JSON has been pretty-printed)
  3. support invalid JSON entries by ignoring them (vs failing)

I am skeptical that (2) and (3) are desirable in practice. (2) implies a tripling of file size and (3) implies you're fine ignoring corrupt data. (1) seems reasonable, but I wonder if it's actually used in practice, or if the path of least resistance tends to lead folks to newlines anyway.

I don't have a lot of context about how the geo world uses this file format. Looking at the Daylight GeoJSON file, it does not use the RS character.

tl;dr - this PR isn't in full compliance with the RFC, but that's probably OK? Can always revisit later if it turns out people actually use those features.

systemed commented 6 months ago

tl;dr - this PR isn't in full compliance with the RFC, but that's probably OK? Can always revisit later if it turns out people actually use those features.

I'd be surprised if there's more than half a dozen GeoJSON files in existence using RS as a record separator!

That's perfect - thank you.