Tracker: GIS Category Loaders / Flat Binary Array support

ibgreen commented 4 years ago

The loaders in GIS category, specifically MVTLoader could benefit from being able to parse data into flat binary arrays.

[x] Define how these arrays would look for various feature types (see discussion below)
[ ] Create helper function that converts geojson to such arrays.

kylebarron commented 4 years ago

If I'm not mistaken, the current flow of the MVTLoader when used with the Deck.gl MVTLayer is:

Load geometries and rescale to 0-1 tile extents within the MVTLoader worker; return a list of GeoJSON objects
In the MVTLayer, take those GeoJSON objects, create a series of TypedArrays given the provided props, and upload to the GPU.

So in order to speed this up, I see two options:

A general GeoJSON to TypedArray function that runs in the worker, so that the process would be load geometries -> GeoJSON objects -> TypedArrays; then send those TypedArrays back to the main thread.
A function tied specifically to the MVTLoader that prevents the need for creation of intermediate GeoJSON objects.

I wrote up a quick prototype of the latter, specifically taking predefined LineString features straight from the @mapbox/vector-tile Tile object and adding their coordinates and attributes to an object of TypedArrays in the format that PathLayer expects.

Code

```js // used to filter specified features within each layer const geometryIndices = { waterway: [0, 2, 4] } // used to override default styling properties const properties = { waterway: { 0: { color: [100, 150, 200], width: 1 } } } function toPathLayer({tile, geometryIndices, properties}) { let startIndexCounter = 0; const startIndices = [0]; const positions = [] const colors = [] const widths = [] for (const layerName in geometryIndices) { const layerIndices = geometryIndices[layerName] for (const index of layerIndices) { const feature = tile.layers[layerName].feature(index) const geometry = feature.loadGeometry() // Add geometry let nVertices = 0 for (const point of geometry[0]) { positions.push(point.x) positions.push(point.y) nVertices += 2 } startIndexCounter += nVertices; startIndices.push(startIndexCounter); // Find properties for this feature let color = DEFAULT_COLOR; let width = DEFAULT_WIDTH; if (properties[layerName] && properties[layerName][index]) { if (properties[layerName][index].color) { color = properties[layerName][index].color } if (properties[layerName][index].width) { width = properties[layerName][index].width } } // Add properties to arrays for (let i = 0; i < nVertices; i++) { colors.push.apply(color); widths.push(width) } } } // data object ready to be passed to PathLayer return { length: startIndices.slice(-1)[0], startIndices: startIndices, positionFormat: 'XY', attributes: { getPath: Float32Array.from(positions), getColors: Uint8ClampedArray.from(colors), getWidths: Float32Array.from(widths) } } } ```

Remarks:

In my prototype, I'm trying to additionally generate styling attributes, in which case, information about how each feature is to be styled needs to already exist. I saw you mention previously that a function can't be serialized through postMessage to the worker, so attribute generation generally might not be possible.
I don't think it's possible to know the number of coordinates without looking at each feature individually, and thus it wouldn't be possible to initialize a TypedArray of positions to the correct length.
How closely should TypedArray responses align with Deck.gl formats?

From https://github.com/uber/deck.gl/pull/3935#issuecomment-598745848

Converting general geojson (with different feature types) to a flat array leads to a rather complex flat array (many different coordinate array structures). Generating a bunch of different flat arrays optimization for the different feature types could also be an option.

Yes, I'm working under the assumption that a separate flat array would be created for each type of geometry. This would make it easier to pass each response type to a different Deck.gl layer without any further processing.

ibgreen commented 4 years ago

A function tied specifically to the MVTLoader that prevents the need for creation of intermediate GeoJSON objects.

While I agree this avoiding intermediate GeoJson object creation is the long-term desirable setup, given the number of moving parts here, it could be worth not dealing with that issue initially.

I would start with creating a generic function that converts geojson (already parsed/extracted) to binary arrays.

If you then call that function in the worker loader, you will be able to pass back typed arrays instead of javascript data structure, avoiding postMessage serialization and deserialization, already a major win.

You can then test the resulting typed arrays with deck.gl and make sure everything works.

We can finally go back to each loader that returns geojson formatted data and see if there is a way to avoid the creation of intermediary geojson inside the loader.

You can see where I am going with this: avoiding the intermediate GeoJson requires custom work for each loader and is really the final touch when this all works end-to-end.

ibgreen commented 4 years ago

PS - love your expandable code section... Neat trick!

How closely should TypedArray responses align with Deck.gl formats?

The deck.gl binary attribute support have been evolving for a while now. It is maturing but is still quite deck.gl specific (i.e. the key are called getPositions, not positions)

Given that loaders.gl is strongly advertised as being framework independent, I would make a function that returns a very general set of typed arrays and offer a function that converts to most current deck.gl binary data object. Maybe similar to the point cloud example?

ibgreen commented 4 years ago

I don't think it's possible to know the number of coordinates without looking at each feature individually, and thus it wouldn't be possible to initialize a TypedArray of positions to the correct length.

While this may not be the very most optimized version for deck.gl, the most natural way to represent this in binary IMHO is with multiple arrays, that contain the indices.

Points:

positions: Float32Array|Float64Array 2 (or 3) values per point

Lines:

pathIndices: Uint32 start vertex index for each line
positions: Float32Array|Float64Array 2 (or 3) values per vertex

Polygons

polygonIndices: Uint32Array- start offset indices for each complex polygon
primitivePolygonIndices: Uint32Array - start offset indices for each simple polygon/hole
positions: Float32Array|Float64Array 2 (or 3) values per point

In my prototype, I'm trying to additionally generate styling attributes, in which case, information about how each feature is to be styled needs to already exist. I saw you mention previously that a function can't be serialized through postMessage to the worker, so attribute generation generally might not be possible.

Your idea of optionally extracting numerically values properties into typed arrays is very neat!
You may also want an objectId typed array that gives the index of each vertex into the original arrays. Beyond helping with the array split, it also lets you flatten MultiPoint, MultiPolygon etc map to multiple primitives, while still allowing them to refer back to the original indices.

kylebarron commented 4 years ago

Ok, here's a stab at it. Essentially filling data arrays while iterating over an array of Features, and then converting the arrays to typed arrays. I also put it in a repo here, where I was testing with some geojson datasets used in deck examples.

Code

```js function featuresToArrays({ features }) { const points = { positions: [], objectIds: [] }; const lines = { pathIndices: [0], positions: [], objectIds: [] }; const polygons = { polygonIndices: [0], primitivePolygonIndices: [0], positions: [], objectIds: [] }; let featureCounter = 0; for (const feature of features) { const geometry = feature.geometry; if (geometry.type === "Point") { points.objectIds.push(featureCounter); points.positions.push(...geometry.coordinates); } else if (geometry.type === "MultiPoint") { points.objectIds.push(featureCounter); points.positions.push(...geometry.coordinates.flat()); } else if (geometry.type === "LineString") { lines.objectIds.push(featureCounter); const index = lines.positions.push(...geometry.coordinates.flat()); lines.pathIndices.push(index); } else if (geometry.type === "MultiLineString") { lines.objectIds.push(featureCounter); for (const line of geometry.coordinates) { const index = lines.positions.push(...line.flat()); lines.pathIndices.push(index); } } else if (geometry.type === "Polygon") { polygons.objectIds.push(featureCounter); let linearRingCounter = 0; for (const linearRing of geometry.coordinates) { const index = polygons.positions.push(...linearRing.flat()); polygons.primitivePolygonIndices.push(index); if (linearRingCounter === 0) polygons.polygonIndices.push(index); linearRingCounter++; } } else if (geometry.type === "MultiPolygon") { polygons.objectIds.push(featureCounter); for (const polygon of geometry.coordinates) { let linearRingCounter = 0; for (const linearRing of polygon) { const index = polygons.positions.push(...linearRing.flat()); polygons.primitivePolygonIndices.push(index); if (linearRingCounter === 0) polygons.polygonIndices.push(index); linearRingCounter++; } } } featureCounter++; } return { points: { positions: Float32Array.from(points.positions), objectIds: Uint32Array.from(points.objectIds) }, lines: { pathIndices: Uint32Array.from(lines.pathIndices), positions: Float32Array.from(lines.positions), objectIds: Uint32Array.from(lines.objectIds) }, polygons: { polygonIndices: Uint32Array.from(polygons.polygonIndices), primitivePolygonIndices: Uint32Array.from( polygons.primitivePolygonIndices ), positions: Float32Array.from(polygons.positions), objectIds: Uint32Array.from(polygons.objectIds) } }; } ```

Remarks:

objectIds points to the index of geometry in the original features array. So the polygon whose coordinates are those from polygonIndices[0] to polygonIndices[1] is the original feature feature[objectIds[0]].
How to deal with split 2D/3D coordinates? The spec on coordinates only says there must be between 2-3 coordinates per point, so it seems permissible to have some coordinates that are only 2D and some that are 3D. There could be an argument positionFormat, and if its value is XYZ then every coordinate added to positions could be length-checked, and 0 or NaN is filled if the coordinates are only of length 2.
Doesn't handle GeometryCollections
I tried to search for an existing GeoJSON test suite, like mvt-fixtures exists for vector tiles, but didn't find one, so this isn't well-tested.

ibgreen commented 4 years ago

Yes, this looks awesome just along the lines I was thinking!

(micro-nit: not sure I ever saw a better use case for a switch statement...)

If you look at the loaders.gl module structure, you will see that some loader categories have a their own module with helper classes (@loaders.gl/tables for the tabular loaders, @loaders.gl/tiles for the 3D tile loaders, ...). I feel this would be a perfect time to create a @loaders.gl/gis or @loaders.gl/geospatial module to provide such helper functions for what we currently call the GIS category loaders.

Setting up a new module is straightforward but requires a bit of scaffolding. Let me know if you are interested in doing this, if not, if you are interested in contributing the code above if we scaffold this for you.

You could also start by adding this as an extra experimental (underscore) export to the mvt loader module.

ibgreen commented 4 years ago

For 2D/3D coordinates I think this should be an input parameter to your conversion routine.

At maturity, you probably want to provide a sniffer/detector function that scans the geojson first

detecting if any 3D coordinates are present
detecting which properties are present so that you can add columns for these (may not be in all features)
counting the lengths of various arrays so you can allocate the typed arrays up front instead of building up temporary JS arrays.
etc...

ibgreen commented 4 years ago

About objectIds - For best compatibility with shader code (shader process one vertex at a time without access to any other data), the objectIds array should be the same length as positions (divided by 2 or 3 of course). I.e. each vertex should have its objectId available by same index in that array

kylebarron commented 4 years ago

Setting up a new module is straightforward but requires a bit of scaffolding. Let me know if you are interested in doing this, if not, if you are interested in contributing the code above if we scaffold this for you.

I can try to scaffold; I'll see how it goes.

At maturity, you probably want to provide a sniffer/detector function that scans the geojson first

I wasn't sure if it would be faster to iterate through features twice: once to scan, detect issues, and measure number of coordinates to pre-allocate typed arrays, and once to add the coordinates to the typed arrays; or a single pass where intermediate arrays were created and then typed arrays are created at the end.

detecting which properties are present so that you can add columns for these (may not be in all features)

Right now I don't touch properties at all, because it seems like typed arrays for properties would only be useful for specific deck.gl classes. Since properties could be arbitrary types, I'm not sure the best way to handle them.

About objectIds - For best compatibility with shader code (shader process one vertex at a time without access to any other data), the objectIds array should be the same length as positions (divided by 2 or 3 of course). I.e. each vertex should have its objectId available by same index in that array

Ok, this makes sense. The value of objectIds should still point to the index of the feature, right? Not a vertex within the feature?

ibgreen commented 4 years ago

I can try to scaffold; I'll see how it goes.

Great, don't overwork it, just put something up, I'll land your PR asap and make any necessary cleanup.

I wasn't sure if it would be faster to iterate through features twice

Always hard to predict performance for sure, but my guess is that 2-pass is on the average equal or faster once you factor in that you can essentially avoid any temporary object creation, increasing memory pressure etc, and overall usability/robustness will be better.

Also the conversion logic will be simpler when you don't have to worry about suddenly finding an unexpected 3D coordinate etc - especially once you start adding more features such as props.

Since properties could be arbitrary types, I'm not sure the best way to handle them.

Yes, unless we add significant complexity it probably only makes sense to support props that are always numbers (think price, elevation, ...). It seems to me that numbers could be supported fairly easily: if you have a sniffer function, you could scan for property types which are always numbers and offer these. Like the objectIds it may make most sense to make this match the length of the positions array so that they can easily be used in shaders.

You could have filter options for props to avoid creating such arrays if not needed.

Ok, this makes sense. The value of objectIds should still point to the index of the feature, right? Not a vertex within the feature?

Yes you want 1) a unique objectId for all vertices that belong to a logical grouping. 2) since this id can be returned by the picking process, you can also make it serve as a meaningful index.

The candidates for this index are the index in the original geojson array or the index into the "filtered" point/path/polygon-only array. You could create both, if not I vote for the original geojson index as you suggest since it is most likely to be meaningful to the end user.

kylebarron commented 4 years ago

It might be a few days before I get to it, so I wanted to keep track of the ideas from #703

[ ] Add option to MVTLoader, JSONLoader(?) for this to actually be called from within the load worker.
[ ] Add typescript definitions
[ ] Should numericKeys be separated by geometry type, e.g. for if only one geometry type contains a certain numeric property. https://github.com/uber-web/loaders.gl/pull/703#discussion_r399522939
[ ] Small updates from last comments https://github.com/uber-web/loaders.gl/pull/703#pullrequestreview-383313826
[ ] Missing values for positions/numeric properties. Currently arrays are initialized to 0, so for mixed 2D/3D input or numeric properties that don't exist for that feature, the de facto missing value is 0. We could initialize Float typed arrays to NaN, so that unfilled items stay as NaN.

kylebarron commented 4 years ago

It looks like the JSONLoader isn't workerized? Is this because usually serializing/deserializing the JSON objects across threads would take longer than the parsing itself?

Now if parsing and JSON -> binary arrays can happen on a worker thread, it would be nice to add (optional) support for workers to the JSONLoader.

Also, it might be worth it to add .geojson as a file type the JSONLoader supports.

ibgreen commented 4 years ago

It looks like the JSONLoader isn't workerized? Is this because usually serializing/deserializing the JSON objects across threads would take longer than the parsing itself?

Yes. There are a few reasons:

we didn't have your binary converter
we have a streaming JSON loader that offers great performance in a different way and allows larger than \<add browser specific limit> GB files (but that would be harder to put on a worker).

it would be nice to add (optional) support for workers to the JSONLoader. Also, it might be worth it to add .geojson as a file type the JSONLoader supports.

First, what should a generic JSONLoader return? Probably not a geojson specific binary bundle...
We could add an options.json.binaryGeoJSON, but that feels rather special purpose.
Maybe better to add a GeoJSONLoader that registers the .geojson file type, and belongs to the GIS category, and has options.geojson.binary that activates the binary conversion?
Ultimately I expect we'd want all our GIS category loaders to support this binary format.

Just ideating here, there are probably quite a few more wrinkles, so please keep the discussion going...

kylebarron commented 4 years ago

Good points.

Since JSON is arbitrary without any data schema, while GeoJSON has a well-defined schema, I think you're right to suggest that JSONLoader and GeoJSONLoader be two separate modules. I presume GeoJSONLoader can reuse all of JSONLoader.

On the binary format itself, I've been reading more about Arrow, and it's quite exciting. While the GeoJSON to binary code is still new, should there be discussion about the binary format stability? If Deck.gl and Loaders.gl support Arrow more directly in the future, it might be worthwhile to at some point have a "GeoJSON to Arrow" utility, where maybe each geometry type is its own Arrow Table. Is it of interest to support conversion to two separate binary formats? If not, is it ok to start with the current binary arrays and move to Arrow when it's more supported?

Also, I think it's worthwhile to start thinking about how these binary arrays could be integrated directly into deck.gl. E.g. as directly supported by the GeoJSONLayer and MVTLayer. Accessors would have to be applied across both string objects and/or numeric property arrays.

ibgreen commented 4 years ago

Since JSON is arbitrary without any data schema, while GeoJSON has a well-defined schema, I think you're right to suggest that JSONLoader and GeoJSONLoader be two separate modules. I presume GeoJSONLoader can reuse all of JSONLoader.

Yes. I am certainly suggesting two separate loaders though they could go into the same module. But since jsonloader is generic, it seems unfair that every JSONLoader user should need to import geojson specific "bloatware", so probably two modules :)

it might be worthwhile to at some point have a "GeoJSON to Arrow" utility

Yes. That is the plan!

I believe that we can simply create a generic converter from maps of {size, value, ...} accessor objects to arrow tables. I believe this can cover the actual conversion. It is somewhat fiddly to get started with Arrow, I could help you get up to speed.

There is a tables category module for tables and an arrow module for Arrow specific code, we should be able to start working in those.

should there be discussion about the binary format stability?

If you mean the maps of {size, value, ...} "accessor objects" that you already generate, they are common across luma.gl deck.gl and loaders.gl, and documented in luma.gl - and maybe in the mesh category in loaders.gl, so yes they should be stable (however current definition does not support your nested numeric props and prop objects so we may need to expand the definition a bit, or flatten your structure).

"GeoJSON to Arrow" utility, where maybe each geometry type is its own Arrow Table

Yes, loading geojson into three arrow tables make sense, just like we are currently loading them into three binary object maps (also note that some data sources, while technically geojson, only contain one type of geojson primitives meaning only one table is used, we may be able to offer some options to simplify such usage).

Also, I think it's worthwhile to start thinking about how these binary arrays could be integrated directly into deck.gl. E.g. as directly supported by the GeoJSONLayer and MVTLayer.

Yes. Note that even though we generate binary attributes, those cannot always be used directly by the GPU. They still need to be tesselated, at least for polygons (I believe some normalization may also be needed for paths). The good news is that a lot of the binary groundwork has already been done in deck, so I recommend we progress a bit more here, then we can start dealing with that.

kylebarron commented 4 years ago

Yes. I am certainly suggesting two separate loaders though they could go into the same module. But since jsonloader is generic, it seems unfair that every JSONLoader user should need to import geojson specific "bloatware", so probably two modules :)

Yes, simple JSON loading shouldn't have to deal with this bloat... Also that means we can workerize only the GeoJSONLoader.

I believe that we can simply create a generic converter from maps of {size, value, ...} accessor objects to arrow tables. I believe this can cover the actual conversion. It is somewhat fiddly to get started with Arrow, I could help you get up to speed.

Interesting. I think I overlooked the potential for keeping so much of the existing code and generating Arrow columns from the typed array output. So my question about binary format stability was mainly based on thinking much code would need to be rewritten for Arrow, and then whether both output formats were desired.

The one other question is when converting to Arrow whether it's desired to include string properties, so that all properties can be included in the Arrow Table. I haven't looked too deep into how strings are encoded; are they always encoded as a Dictionary type?

Yes, loading geojson into three arrow tables make sense, just like we are currently loading them into three binary object maps (also note that some data sources, while technically geojson, only contain one type of geojson primitives meaning only one table is used, we may be able to offer some options to simplify such usage).

True. Also a GeoJSONLoader probably should correctly parse geometries that aren't wrapped in a FeatureCollection. The current GeoJSON to binary code assumes the Feature array input conforms correctly to the spec.

Yes. Note that even though we generate binary attributes, those cannot always be used directly by the GPU. They still need to be tesselated, at least for polygons (I believe some normalization may also be needed for paths). The good news is that a lot of the binary groundwork has already been done in deck, so I recommend we progress a bit more here, then we can start dealing with that.

Ah I didn't realize the tessellation would be computed on CPU and not GPU? From the PathLayer docs, I think that if the binary attributes are correctly formed, you should be able to bypass any further normalization.

So what's the next step here? Create a GeoJSONLoader and support add a binary conversion option to that and the MVTLoader?

ibgreen commented 4 years ago

The one other question is when converting to Arrow whether it's desired to include string properties, so that all properties can be included in the Arrow Table. I haven't looked too deep into how strings are encoded; are they always encoded as a Dictionary type?

No. There is a Utf8 type. Each row will have a different number of bytes. Because of this, string columns are not usable in GPU shaders, but certainly can be accessed on the CPU, so if we want to preserve as much data as we can in an arrow table we can offer the option to include string columns.

So what's the next step here? Create a GeoJSONLoader and support add a binary conversion option to that and the MVTLoader?

Yes GeoJsonLoader is a good start. For orthogonality, we should probably support this feature (binary output and worker loaders) for all GIS category loaders (includes at least WKTLoader and KMLLoader, though not 100% sure if the latter will currently run on a worker due to the hacky XMLLoader). We need to think about the right combo of options. I suppose a starting point is geojson: true and binary: true?

ibgreen commented 3 years ago

Mostly implemented in 3.0

visgl / loaders.gl

Tracker: GIS Category Loaders / Flat Binary Array support #685