Closed ibgreen closed 3 years ago
If I'm not mistaken, the current flow of the MVTLoader
when used with the Deck.gl MVTLayer
is:
MVTLoader
worker; return a list of GeoJSON objectsMVTLayer
, take those GeoJSON objects, create a series of TypedArray
s given the provided props, and upload to the GPU.So in order to speed this up, I see two options:
TypedArray
function that runs in the worker, so that the process would be load geometries -> GeoJSON objects -> TypedArrays; then send those TypedArrays back to the main thread.MVTLoader
that prevents the need for creation of intermediate GeoJSON objects. I wrote up a quick prototype of the latter, specifically taking predefined LineString
features straight from the @mapbox/vector-tile
Tile
object and adding their coordinates and attributes to an object of TypedArray
s in the format that PathLayer
expects.
Remarks:
postMessage
to the worker, so attribute generation generally might not be possible.TypedArray
of positions to the correct length.From https://github.com/uber/deck.gl/pull/3935#issuecomment-598745848
Converting general geojson (with different feature types) to a flat array leads to a rather complex flat array (many different coordinate array structures). Generating a bunch of different flat arrays optimization for the different feature types could also be an option.
Yes, I'm working under the assumption that a separate flat array would be created for each type of geometry. This would make it easier to pass each response type to a different Deck.gl layer without any further processing.
A function tied specifically to the MVTLoader that prevents the need for creation of intermediate GeoJSON objects.
While I agree this avoiding intermediate GeoJson object creation is the long-term desirable setup, given the number of moving parts here, it could be worth not dealing with that issue initially.
I would start with creating a generic function that converts geojson (already parsed/extracted) to binary arrays.
If you then call that function in the worker loader, you will be able to pass back typed arrays instead of javascript data structure, avoiding postMessage
serialization and deserialization, already a major win.
You can then test the resulting typed arrays with deck.gl and make sure everything works.
We can finally go back to each loader that returns geojson formatted data and see if there is a way to avoid the creation of intermediary geojson inside the loader.
You can see where I am going with this: avoiding the intermediate GeoJson requires custom work for each loader and is really the final touch when this all works end-to-end.
PS - love your expandable code section... Neat trick!
How closely should TypedArray responses align with Deck.gl formats?
The deck.gl binary attribute support have been evolving for a while now. It is maturing but is still quite deck.gl specific (i.e. the key are called getPositions
, not positions
)
Given that loaders.gl is strongly advertised as being framework independent, I would make a function that returns a very general set of typed arrays and offer a function that converts to most current deck.gl binary data object. Maybe similar to the point cloud example?
I don't think it's possible to know the number of coordinates without looking at each feature individually, and thus it wouldn't be possible to initialize a TypedArray of positions to the correct length.
While this may not be the very most optimized version for deck.gl, the most natural way to represent this in binary IMHO is with multiple arrays, that contain the indices.
Points:
Lines:
Polygons
In my prototype, I'm trying to additionally generate styling attributes, in which case, information about how each feature is to be styled needs to already exist. I saw you mention previously that a function can't be serialized through postMessage to the worker, so attribute generation generally might not be possible.
objectId
typed array that gives the index of each vertex into the original arrays. Beyond helping with the array split, it also lets you flatten MultiPoint, MultiPolygon etc map to multiple primitives, while still allowing them to refer back to the original indices.Ok, here's a stab at it. Essentially filling data arrays while iterating over an array of Features
, and then converting the arrays to typed arrays. I also put it in a repo here, where I was testing with some geojson datasets used in deck examples.
Remarks:
objectIds
points to the index of geometry in the original features
array. So the polygon whose coordinates are those from polygonIndices[0]
to polygonIndices[1]
is the original feature feature[objectIds[0]]
.positionFormat
, and if its value is XYZ
then every coordinate added to positions
could be length-checked, and 0
or NaN
is filled if the coordinates are only of length 2.GeometryCollection
smvt-fixtures
exists for vector tiles, but didn't find one, so this isn't well-tested.Yes, this looks awesome just along the lines I was thinking!
(micro-nit: not sure I ever saw a better use case for a switch
statement...)
If you look at the loaders.gl module structure, you will see that some loader categories have a their own module with helper classes (@loaders.gl/tables
for the tabular loaders, @loaders.gl/tiles
for the 3D tile loaders, ...). I feel this would be a perfect time to create a @loaders.gl/gis
or @loaders.gl/geospatial
module to provide such helper functions for what we currently call the GIS category loaders.
Setting up a new module is straightforward but requires a bit of scaffolding. Let me know if you are interested in doing this, if not, if you are interested in contributing the code above if we scaffold this for you.
You could also start by adding this as an extra experimental (underscore) export to the mvt loader module.
For 2D/3D coordinates I think this should be an input parameter to your conversion routine.
At maturity, you probably want to provide a sniffer/detector function that scans the geojson first
properties
are present so that you can add columns for these (may not be in all features)About objectIds
- For best compatibility with shader code (shader process one vertex at a time without access to any other data), the objectIds array should be the same length as positions
(divided by 2 or 3 of course). I.e. each vertex should have its objectId
available by same index in that array
Setting up a new module is straightforward but requires a bit of scaffolding. Let me know if you are interested in doing this, if not, if you are interested in contributing the code above if we scaffold this for you.
I can try to scaffold; I'll see how it goes.
At maturity, you probably want to provide a sniffer/detector function that scans the geojson first
I wasn't sure if it would be faster to iterate through features
twice: once to scan, detect issues, and measure number of coordinates to pre-allocate typed arrays, and once to add the coordinates to the typed arrays; or a single pass where intermediate arrays were created and then typed arrays are created at the end.
detecting which properties are present so that you can add columns for these (may not be in all features)
Right now I don't touch properties at all, because it seems like typed arrays for properties would only be useful for specific deck.gl classes. Since properties could be arbitrary types, I'm not sure the best way to handle them.
About
objectIds
- For best compatibility with shader code (shader process one vertex at a time without access to any other data), the objectIds array should be the same length aspositions
(divided by 2 or 3 of course). I.e. each vertex should have itsobjectId
available by same index in that array
Ok, this makes sense. The value of objectIds
should still point to the index of the feature, right? Not a vertex within the feature?
I can try to scaffold; I'll see how it goes.
Great, don't overwork it, just put something up, I'll land your PR asap and make any necessary cleanup.
I wasn't sure if it would be faster to iterate through features twice
Always hard to predict performance for sure, but my guess is that 2-pass is on the average equal or faster once you factor in that you can essentially avoid any temporary object creation, increasing memory pressure etc, and overall usability/robustness will be better.
Also the conversion logic will be simpler when you don't have to worry about suddenly finding an unexpected 3D coordinate etc - especially once you start adding more features such as props.
Since properties could be arbitrary types, I'm not sure the best way to handle them.
Yes, unless we add significant complexity it probably only makes sense to support props that are always numbers (think price, elevation, ...). It seems to me that numbers could be supported fairly easily: if you have a sniffer function, you could scan for property types which are always numbers and offer these. Like the objectIds it may make most sense to make this match the length of the positions array so that they can easily be used in shaders.
You could have filter options for props to avoid creating such arrays if not needed.
Ok, this makes sense. The value of objectIds should still point to the index of the feature, right? Not a vertex within the feature?
Yes you want 1) a unique objectId for all vertices that belong to a logical grouping. 2) since this id can be returned by the picking process, you can also make it serve as a meaningful index.
The candidates for this index are the index in the original geojson array or the index into the "filtered" point/path/polygon-only array. You could create both, if not I vote for the original geojson index as you suggest since it is most likely to be meaningful to the end user.
It might be a few days before I get to it, so I wanted to keep track of the ideas from #703
numericKeys
be separated by geometry type, e.g. for if only one geometry type contains a certain numeric property. https://github.com/uber-web/loaders.gl/pull/703#discussion_r399522939NaN
, so that unfilled items stay as NaN
.It looks like the JSONLoader
isn't workerized? Is this because usually serializing/deserializing the JSON objects across threads would take longer than the parsing itself?
Now if parsing and JSON -> binary arrays can happen on a worker thread, it would be nice to add (optional) support for workers to the JSONLoader
.
Also, it might be worth it to add .geojson
as a file type the JSONLoader
supports.
It looks like the JSONLoader isn't workerized? Is this because usually serializing/deserializing the JSON objects across threads would take longer than the parsing itself?
Yes. There are a few reasons:
it would be nice to add (optional) support for workers to the JSONLoader. Also, it might be worth it to add .geojson as a file type the JSONLoader supports.
JSONLoader
return? Probably not a geojson specific binary bundle...options.json.binaryGeoJSON
, but that feels rather special purpose.GeoJSONLoader
that registers the .geojson
file type, and belongs to the GIS category, and has options.geojson.binary
that activates the binary conversion?Just ideating here, there are probably quite a few more wrinkles, so please keep the discussion going...
Good points.
Since JSON is arbitrary without any data schema, while GeoJSON has a well-defined schema, I think you're right to suggest that JSONLoader
and GeoJSONLoader
be two separate modules. I presume GeoJSONLoader
can reuse all of JSONLoader
.
On the binary format itself, I've been reading more about Arrow, and it's quite exciting. While the GeoJSON to binary code is still new, should there be discussion about the binary format stability? If Deck.gl and Loaders.gl support Arrow more directly in the future, it might be worthwhile to at some point have a "GeoJSON to Arrow" utility, where maybe each geometry type is its own Arrow Table. Is it of interest to support conversion to two separate binary formats? If not, is it ok to start with the current binary arrays and move to Arrow when it's more supported?
Also, I think it's worthwhile to start thinking about how these binary arrays could be integrated directly into deck.gl. E.g. as directly supported by the GeoJSONLayer
and MVTLayer
. Accessors would have to be applied across both string objects and/or numeric property arrays.
Since JSON is arbitrary without any data schema, while GeoJSON has a well-defined schema, I think you're right to suggest that JSONLoader and GeoJSONLoader be two separate modules. I presume GeoJSONLoader can reuse all of JSONLoader.
Yes. I am certainly suggesting two separate loaders though they could go into the same module. But since jsonloader is generic, it seems unfair that every JSONLoader user should need to import geojson specific "bloatware", so probably two modules :)
it might be worthwhile to at some point have a "GeoJSON to Arrow" utility
Yes. That is the plan!
I believe that we can simply create a generic converter from maps of {size, value, ...}
accessor objects to arrow tables. I believe this can cover the actual conversion. It is somewhat fiddly to get started with Arrow, I could help you get up to speed.
There is a tables
category module for tables and an arrow
module for Arrow specific code, we should be able to start working in those.
should there be discussion about the binary format stability?
If you mean the maps of {size, value, ...}
"accessor objects" that you already generate, they are common across luma.gl deck.gl and loaders.gl, and documented in luma.gl - and maybe in the mesh category in loaders.gl, so yes they should be stable (however current definition does not support your nested numeric props and prop objects so we may need to expand the definition a bit, or flatten your structure).
"GeoJSON to Arrow" utility, where maybe each geometry type is its own Arrow Table
Yes, loading geojson into three arrow tables make sense, just like we are currently loading them into three binary object maps (also note that some data sources, while technically geojson, only contain one type of geojson primitives meaning only one table is used, we may be able to offer some options to simplify such usage).
Also, I think it's worthwhile to start thinking about how these binary arrays could be integrated directly into deck.gl. E.g. as directly supported by the GeoJSONLayer and MVTLayer.
Yes. Note that even though we generate binary attributes, those cannot always be used directly by the GPU. They still need to be tesselated, at least for polygons (I believe some normalization may also be needed for paths). The good news is that a lot of the binary groundwork has already been done in deck, so I recommend we progress a bit more here, then we can start dealing with that.
Yes. I am certainly suggesting two separate loaders though they could go into the same module. But since jsonloader is generic, it seems unfair that every JSONLoader user should need to import geojson specific "bloatware", so probably two modules :)
Yes, simple JSON loading shouldn't have to deal with this bloat... Also that means we can workerize only the GeoJSONLoader.
I believe that we can simply create a generic converter from maps of
{size, value, ...}
accessor objects to arrow tables. I believe this can cover the actual conversion. It is somewhat fiddly to get started with Arrow, I could help you get up to speed.
Interesting. I think I overlooked the potential for keeping so much of the existing code and generating Arrow columns from the typed array output. So my question about binary format stability was mainly based on thinking much code would need to be rewritten for Arrow, and then whether both output formats were desired.
The one other question is when converting to Arrow whether it's desired to include string properties, so that all properties can be included in the Arrow Table. I haven't looked too deep into how strings are encoded; are they always encoded as a Dictionary type?
Yes, loading geojson into three arrow tables make sense, just like we are currently loading them into three binary object maps (also note that some data sources, while technically geojson, only contain one type of geojson primitives meaning only one table is used, we may be able to offer some options to simplify such usage).
True. Also a GeoJSONLoader
probably should correctly parse geometries that aren't wrapped in a FeatureCollection
. The current GeoJSON to binary code assumes the Feature
array input conforms correctly to the spec.
Yes. Note that even though we generate binary attributes, those cannot always be used directly by the GPU. They still need to be tesselated, at least for polygons (I believe some normalization may also be needed for paths). The good news is that a lot of the binary groundwork has already been done in deck, so I recommend we progress a bit more here, then we can start dealing with that.
Ah I didn't realize the tessellation would be computed on CPU and not GPU? From the PathLayer
docs, I think that if the binary attributes are correctly formed, you should be able to bypass any further normalization.
So what's the next step here? Create a GeoJSONLoader and support add a binary conversion option to that and the MVTLoader
?
The one other question is when converting to Arrow whether it's desired to include string properties, so that all properties can be included in the Arrow Table. I haven't looked too deep into how strings are encoded; are they always encoded as a Dictionary type?
No. There is a Utf8
type. Each row will have a different number of bytes. Because of this, string columns are not usable in GPU shaders, but certainly can be accessed on the CPU, so if we want to preserve as much data as we can in an arrow table we can offer the option to include string columns.
So what's the next step here? Create a GeoJSONLoader and support add a binary conversion option to that and the MVTLoader?
Yes GeoJsonLoader is a good start. For orthogonality, we should probably support this feature (binary output and worker loaders) for all GIS category loaders (includes at least WKTLoader
and KMLLoader
, though not 100% sure if the latter will currently run on a worker due to the hacky XMLLoader). We need to think about the right combo of options. I suppose a starting point is geojson: true
and binary: true
?
Mostly implemented in 3.0
The loaders in GIS category, specifically
MVTLoader
could benefit from being able to parse data into flat binary arrays.