visgl / loaders.gl

Loaders for big data visualization. Website:
https://loaders.gl
Other
695 stars 190 forks source link

WKBLoader #767

Closed kylebarron closed 4 years ago

kylebarron commented 4 years ago

Well-known binary (WKB) is a binary geometry encoding. WKB encodes only geometries and doesn't store attributes. It's used in databases such as PostGIS and as the internal storage format of Shapefiles. It's also being discussed as the internal storage format for a "GeoArrow" specification. WKB is defined starting on page 62 of the OGC Simple Features specification.

It's essentially a binary representation of WKT. For common geospatial types including (Multi) Point, Line, and Polygon, there's a 1:1 correspondence between WKT/WKB and GeoJSON. WKT and WKB also support extended geometry types, such as Curve, Surface, and TIN, which don't have a correspondence to GeoJSON.

image

We currently use a fork of wellknown for WKT parsing, but wellknown only supports WKT and not WKB.

The only permissive JS library I've seen to parse WKB is wkx, which parses WKT and WKB (as well as EWKT and EWKB, which include spatial reference identifiers on each record). It uses Node buffers, which requires a polyfill in the browser. I commented a month ago asking about interest in a PR to use DataViews instead of Node Buffers, but haven't received a response from the maintainer yet. The project does also use an older prototype-based code style.

Performance

There's momentum towards greater use of binary data transport in Vis.gl projects. There's potential for faster WKB parsing than existing alternatives by keeping decoded WKB geometries in typed arrays.

wkx appears to parse WKB to GeoJSON, which means there's presumably overhead both of creating many JS objects, and then also passing back to the main thread.

We couldn't avoid a pass over the WKB record entirely, but we could copy entire chunks of WKB data at a time into a Float64Array with minimal processing (possibly reversing byte endianness if needed, since WKB can be encoded in either big or little endianness).

ibgreen commented 4 years ago

@kylebarron Great writeup!

(Since you already took the time to write this, we should consider incorporating this into the WKB docs, it would be great to a such a little format summary, if not for all, then at least for some of our loaders.)

Some guidance:

(possibly reversing byte endianness if needed, since WKB can be encoded in either big or little endianness).

FWIW DataView.getFloat64() etc support an optional endianness parameter: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/DataView/getFloat64

kylebarron commented 4 years ago
  • It is OK to replace the current WKT implementation. Most of the work went into module scaffolding and setting test cases etc, and we can reuse that.

I don't think it will necessarily be helpful to replace the current WKT implementation. While WKT is "comparable" in the sense of: it has a 1:1 correspondence with WKB, writing our own WKB parser won't necessarily make WKT parsing any easier. E.g. to parse the following you need to parse the spaces, parentheses, and commas.

POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))
  • While good to keep an open channel to the maintainers, I think we should not hesitate to fork the wkx code base.
  • WKB is an important format for loaders.gl, so we want to be able to make quick changes.

Yes I agree. Also the WKB specification really isn't too large.

  • If we fork, we should also be able to make an AsyncIterator based version that parses in batches without too much effort.

Yes, I agree as well, though I still have to do more research on this.

FWIW DataView.getFloat64() etc support an optional endianness parameter: developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/DataView/getFloat64

Oh I'd seen that but it didn't fully click... When you call DataView.getFloat64(), it always extracts the number into the host operating system's endianness? So getFloat64 will automatically convert a big-endian double to little endian?

kylebarron commented 4 years ago
  • If we fork, we should also be able to make an AsyncIterator based version that parses in batches without too much effort.

WKB encodes individual records. Given this, for the WKBLoader to support AsyncIterators seems not that useful, since you'd be yielding non-complete portions of a single geometry. It seems more useful to focus on AsyncIterators support for higher-level loaders that call WKBLoader internally

ibgreen commented 4 years ago

I don't think it will necessarily be helpful to replace the current WKT implementation.

My thinking is that if we were forking the wkx module and it had common code for the two cases, then it could make sense.

Agree that if we are not handling the streaming case in this loader, then it probably doesn't matter.

kylebarron commented 4 years ago

What about having a separate BinaryToGeoJSON converter, the inverse of the current GeoJSONtoBinary? This would mean that each geometry loader could choose the more appropriate and performant output type, and converting could then happen if necessary?

That means that each geometry loader can target a single output format, and we avoid creating a binary and a GeoJSON output for every loader?

In this case, the WKB loader would output only binary, and the converter would apply if necessary.

ibgreen commented 4 years ago

Maybe. Though in some cases (KML) geojson can only hold a subset of the data in the original format.

kylebarron commented 4 years ago

For the WKBLoader specifically, it seems like a good idea to focus performance on exporting binary arrays, and then modularize the conversion to GeoJSON so it can be used in the future if necessary.

ibgreen commented 4 years ago

Yes but since it only handles one primitive and we are not streaming, it gets a bit silly - we'd need it to have a function to top off an existing array.

That is probably what you were trying to say above.

But it does make this loader kind of unique.