sofwerx / cdb2-concept

CDB modernization
0 stars 1 forks source link

Tiling Consensus #11

Open ryanfranz opened 3 years ago

ryanfranz commented 3 years ago

Question from the slack channel that got no response outside the tiling group:

I want to bring up two questions that the tiling scheme (GnosisGlobalGrid) that Jerome and I used on the whiteboard lead to:

  1. The tiling scheme does not lead to integer latitude and longitude boundaries. For example, a tile might be 1.40625 degrees on a side (i.e. no geocell sized tiles). Is this a problem for any use case?

  2. At every finer LOD of each tile, most tiles have a normal quadtree split. But at the poles, these tiles split differently. This can be thought of as causing a new "zone" or block/tile size at each pole for every finer/higher LOD. I could see problems with a visualization/simulation if a system uses a virtual image mapping (u,v from 0,1) for terrain imagery, which allows the decoupling of terrain tiles and imagery tiles. Does this cause any problems for anyone's use cases?

  3. Is this tiling scheme consistent with the new focus on creating a repository CDB?

I would really like someone from CAE simulation (or other realtime visualization group) to evaluate this proposed tiling scheme. We (FlightSafety) should be able to use it and have talked about some of the implications it causes with our particular system and how we use the CDB. Since the tiling scheme could impact multiple other areas of the CDB standard, I think that this is one of the first items that really needs to be nailed down and correct in a logical model. Thanks!

ryanfranz commented 3 years ago

I want to be clear that I view this tiling proposal as proposed for all use cases (except for the CDB Foundation). This is not an M&S only tiling scheme, but one that should address many of the issues that caused problems between the CDB 1.x tiling scheme and other use cases. If someone has a reason that this tiling approach does not work for a use case (as opposed to a particular device), please let me know so that we can address it and come to a compromise.

I still think that the fewer tiling schemes used will increase the correlation and interoperability of CDB X.

@freemanjay - You mentioned that different devices might need different tiling schemes. I assume that is irregardless of the use case. Was that a proposal for each device/system having a different CDB export process and profile/extension?

cnreediii commented 3 years ago

@ryanfranz and @freemanjay . In the last couple of OGC Vector Tile Pilots, the same tiling/LoD scheme worked just fine on desktops, smart phones, and other mobile devices. This was true in either fully connected or DDIL environments. I suspect lessons learned by MapBox, Google, and others have allowed for a consistent tiling approach - with of course LoDs - can be used across devices.

freemanjay commented 3 years ago

@cnreediii @ryanfranz

I see five top level -- maybe there are more -- "uses" to address WRT enterprise data: 1) transport, 2) storage, 3) updates 4) visualization, and 5) analysis. These "uses" of enterprise data of course range from low power mobile to cloud to mission command systems to high performance image generators. For the purpose of this thread I define enterprise data as correlated rasters, vectors and 3D models.

Tiling itself is a complex word. To me, tiling has some relationship to LODs. LODs are a form of compression and generalization typically based on a tile of data. It seems to me when we discuss tiling there is a mixture of LODs and gridsets being discussed. For the purpose of this thread I define tiling as a gridset, which could be regular or irregular. I recommend we make a clear distinction on tiling versus LODs.

Fundamentally a concept of tiling will have pros/cons for each of these five uses of enterprise data. Visualization is just ONE of the five uses WRT enterprise data. CDB -- as it stands today -- is a visualization first design optimization based on a regular gridset. Other standards -- like 3D Tiles -- work on both a regular or irregular gridset. CDB -- as it stands today -- is extremely inefficient for performing analytics based on its tiling and LOD concepts. For the purpose of this thread I define analytics using rasters, vectors and 3D objects to solve a problem of some type. In CDB today, the data is poorly structure to answer a question such as "where are all the gas stations that are 10 miles or more from an interstate road" within the state of Florida. For analytics WRT vector data, the state of the art is NOT tiling and layers of data -- the state of the art is object based GEOINT.

My recommendation is that CDB X have tiling profiles that best align to the use cases: 1) transport, 2) storage, 3) updates 4) visualization, and 5) analysis. I am not suggesting that each use case have its own tiling profile, I imagine transport and storage could share. Updates and analysis could potentially share. Visualization would likely be its own tiling profile; however, there are groups within the visualization use case that would dispute regular vs irregular gridsets. Similarly, I recommend that CDB X have optimization profiles that best align to the use cases.

To answer your direction question: Was that a proposal for each device/system having a different CDB export process and profile/extension?

No -- it was a recommendation that we have a set of tiling concepts that optimize the use cases of 1) transport, 2) storage, 3) updates 4) visualization, and 5) analysis.

jerstlouis commented 3 years ago

@freemanjay @ryanfranz @cnreediii

In the OGC 2D Tile Matrix Set standard, the LODs are defined as part of a Tile Matrix Set description, because different numbers of tiles rows and columns are specified at each TileMatrix (LOD) of the set to cover the area. The 2DTMS is a regular grid, but it allows for coalescing tiles for some rows (e.g. to acommodate the smaller real world area of polar regions on a plate carre projection).

My opinion is that all 5 of those use cases (transport, storage, updates, visualization and analysis) can all greatly benefit from a TMS-based tiling, and even more so if a consistent TMS is used.

But this does not prevent an application which prefers to adopt another data model (whether a different tiling scheme, or no tiling at all) to easily do so, yet being able to import and export to such TMS-based tiled data store (e.g. through OGC API - Tiles or a GeoPackage-based CDB X).

Tiling and recombining the data to a non-tiled data store is a simple enough process, even on vector features, especially if the data is tiled to an axis-aligned grid (as with the proposed TMS tiling) and if artificial vertices introduced at tile boundaries are marked as such.

Tiling from a transport or storage perspective facilitates access to the desired resolution and/or Area of Interest for the visualization or analysis use cases. Tiling for updates allows to update a small part of a much larger detailed feature, and allows retrieving delta change-sets by only redownloading affected tiles. Tiling for analysis allows easily parallelizing processing if a process can be performed on a tile-by-tile basis (or if work on individual tiles can be recombined to provide an overall output).

Note that in the proposed vector tiles attributes extension, a single attributes table is suggested with one entry record per feature (not one database per tile as in CDB 1.x), so that only the geometry itself is tiled, so this should greatly facilitate querying the data.

cnreediii commented 3 years ago

@freemanjay - WRT your statement, "Tiling itself is a complex word. To me, tiling has some relationship to LODs. LODs are a form of compression and generalization typically based on a tile of data." First - and apologies - but tiling can actually be very simple. No need to conflate LoDs with how space is tessellated. So compression, generalization, or any other algorithmic processing of the content in a tile (or referenced to a tile) is again a separate discussion from the actual tiling model and tessellation approach. The OGC TC just approved a new OGC Abstract Specification for a Conceptual model for tiling space and a logical model "profile" for tiling 2D Euclidean space. The model is quite simple. Where complexity may enter is when additional concepts are used to extend the logical model to cover such concepts as LoD, layer, symbology, and so forth.

As to "In CDB today, the data is poorly structure to answer a question such as "where are all the gas stations that are 10 miles or more from an interstate road" within the state of Florida." I do not think that this is an issue with the tiling model but rather the physical implementation of the tiling model based on the initial CDB requirements and use cases. In an historical example from a system I helped design and implement back in the mid 1980s, this analytical search would have taken seconds - not bad for GIS software circa 1990. The systems used a variety of spatial indices against a tiled data store stored as a set of folders and files in a UNIX file system. This works when algorithms are designed and implemented to consider that the content is tiled. So I know that analytics in a tiled geospatial data store can be very, very fast!

kevinbentley commented 3 years ago

Carl,

So I know that analytics in a tiled geospatial data store can be very, very fast!

That misses the point entirely. You can certainly tile things and index the tiles, I also have written software that does that. The challenge is making something that works with existing tools and technologies.

There are many design choices you have to make when tiling vectors and none of them are ideal. That's why spatial indexes like r-trees are effective, it accommodates the irregular nature of vector data. Chopping up vector data is possible, but nobody is going to convince me that it's preferable to a spatial index. I've had to reassemble vectors (and deal with the floating-point precision issues that come with that) too many times to believe that. We can debate about how software 'could' be implemented, but it is wasting time and effort.

If I want to do analytics as Jay described, I'm most likely doing that with existing GIS software and/or libraries. We should be designing a system that works with current technology. CDB in its current state with tiled shapefiles does not make analytics like that effective. We should not be considering any form of storage for CDB vectors that don't already have support by 2-3 COTS or FOSS applications.

On Mon, Aug 17, 2020 at 6:09 PM C. Reed notifications@github.com wrote:

@freemanjay https://github.com/freemanjay - WRT your statement, "Tiling itself is a complex word. To me, tiling has some relationship to LODs. LODs are a form of compression and generalization typically based on a tile of data." First - and apologies - but tiling can actually be very simple. No need to conflate LoDs with how space is tessellated. So compression, generalization, or any other algorithmic processing of the content in a tile (or referenced to a tile) is again a separate discussion from the actual tiling model and tessellation approach. The OGC TC just approved a new OGC Abstract Specification for a Conceptual model for tiling space and a logical model "profile" for tiling 2D Euclidean space. The model is quite simple. Where complexity may enter is when additional concepts are used to extend the logical model to cover such concepts as LoD, layer, symbology, and so forth.

As to "In CDB today, the data is poorly structure to answer a question such as "where are all the gas stations that are 10 miles or more from an interstate road" within the state of Florida." I do not think that this is an issue with the tiling model but rather the physical implementation of the tiling model based on the initial CDB requirements and use cases. In an historical example from a system I helped design and implement back in the mid 1980s, this analytical search would have taken seconds - not bad for GIS software circa 1990. The systems used a variety of spatial indices against a tiled data store stored as a set of folders and files in a UNIX file system. This works when algorithms are designed and implemented to consider that the content is tiled. So I know that analytics in a tiled geospatial data store can be very, very fast!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sofwerx/cdb2-concept/issues/11#issuecomment-675176237, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK5MNFHIK36E2HG5IWQNSLSBHBDHANCNFSM4PDZPHIA .

ryanfranz commented 3 years ago

@freemanjay - The tiling proposal was to break the world into 8 tiles, which gets rid of the largest issue with CDB tiling, which was there were more than 40,000 top level tiles in the world. Each of these tiles could successively be split into quad-trees (unless the tile touches a pole). And if a device needed a single level, or a region of the world, that is easy to do with this tiling. So for your example of finding gas stations, if you know what level (map scale) that interstate roads are in, then it is easy to find these roads and buffer them and find gas station features that intersect the buffer. Not a fast process depending on the size of the search area, but you wouldn't even need to reconnect the roads to do this. And an rtree doesn't gain you anything in this case.

One other thing that came up here is a "new" set of use cases that hasn't been discussed: 1) transport, 2) storage, 3) updates 4) visualization, and 5) analysis. So how does tiling affect these?

  1. Transport of data (I assume via OGC Web Services) would necessitate either delivering tiles (Vector Tiles maybe) or a custom query that could be generated from tiles or not. Either way, nothing we have talked about has focused on transport mechanisms or structures, which would be much different than storage CDB. (Just to be clear, I think we should address this, but the storage format needs to be done first.)
  2. Storage of data is what CDB 1.x addresses, and I assumed that we were working toward. This is needed for all of the CDB profiles in the powerpoint, except maybe CDB-F (which might not look like CDB in the end anyway)
  3. Updates are easier to address if your data is tiled, so that only part of the CDB needs to be updated. I am not sure how to "update" a dataset that is a monolithic rtree, without just providing the entire dataset to the user again.
  4. Visualization - I feel like we are trying to provide reasonable compromises to the visualization case to handle tiling in CDB-T/G/A
  5. Analysis - This is the case that I understand the least. Maybe this is the one case that can't be tiled, but it seems like AI/ML should be able to handle data types that are tiled too.

@kevinbentley - I keep hearing over and over that loading vector data has to be done with pre-existing tools and technology, so the input data can't be tiled. But at the same time, the assumption keeps being made that all these (COTS or FOSS) applications can handle very large and potentially very deep rtree data structures. Maybe there are several applications that I don't know about, but the only applications that come to mind are spatial database management systems. And I can't imagine that they don't support rtree structures that can find features across multiple tiles.

All - So I get that this group isn't going to see eye-to-eye on tiling vector data even if that data can't possibly fit into memory. So here are a few questions that maybe we can answer:

ccbrianf commented 3 years ago

@freemanjay Tiling is the method by which data is organized into spatial chunks. It does not necessarily have to define the format of the data inside those chunks, such as regular versus irregular grid. It is often related to LoD simply as a way to manage chuck size (and thus usually resolution as a means to bound that) versus spatial coverage area. In that respect, I don't think we can separate the concepts easily.

In my view, your five use cases are really at most 4. Updates are to affect the storage, and I don't think you really want to covert in and out of a profile just to do a storage update. I can see transport being subtly different from storage, but likely still very similar, except with area of interest and resolution bounds (something like tiling ;-) ). Analysis and visualization are two things you want to do to the storage, either directly, or after it has been transported, and frequently updated. If you are willing to live with frequent conversion for frequent updates, you can certainly define more optimal formats that might differ for analysis versus visualization. I'm not entirely convinced that the conversion burden is desired or required though because I still believe a single best compromise format is the most reasonable approach.

"where are all the gas stations that are 10 miles or more from an interstate road" within the state of Florida, CDB 1.X approach:

  1. Load the coarsest RoadNetwork LoD tile(s) covering Florida, progressing finer until you find the interstate you are looking for. Load adjacent tiles if the feature crosses an edge and has a junction id, to complete the feature. (Junction ids don't suffer from floating point reassembly accuracy, unless you're actually concerned about that level of feature resolution.)
  2. Determine the spatial accuracy/map scale of the interstate road you need to test against, then load that set of RoadNetwork LoD tiles for the geographic locations that the coarser LoD tile(s) indicate contain your interstate
  3. Load all the finest man made point feature LoD tiles within 10 miles of that feature (I presume you actually meant "or less" above, because if you really meant "or more", then that's not well supported by much of any format ;-) ) in parallel and search for gas stations within range

Not the most efficient algorithm or data organization for the task, but certainly workable at least, and I'm sure we can radically improve upon that without significantly limiting other use cases. @jerstlouis describes one such possibility, and it might be that both tiled and monolithic feature attribute tables are required, with multiple indexing methods for the former, to support all use cases efficiently while trying to normalize the data as much as possible. I think both of these approaches could even meet with @kevinbentley's requirements for existing application support, and my determinism requirements too, if carefully designed. But I concede some of the latter might push that envelope slightly. I don't think it's fair to say we can't do that in any way for this design, but I agree it would be nice to minimize that whenever possible.

ccbrianf commented 3 years ago

@kevinbentley and @freemanjay Kevin, you seemed to be the de-facto chair of the vector tiling group. AFAIK, we haven't met since the initial group formation to discuss things. One thing that might help us make some progress is to define the available menu of formats that meets your objective: "We should not be considering any form of storage for CDB vectors that don't already have support by 2-3 COTS or FOSS applications." Do we have one? This isn't my area of expertise. I'm hoping the menu is slightly larger than just GeoPackage, MVT and GeoJSON?

cnreediii commented 3 years ago

All: We have been having a really good discussion on this topic. However, I am thinking we need to perhaps defer discussion on whether to tile or not to tile and let the experiments run to completion. I do believe that there is agreement that spatial partitioning is needed regardless, such as described in the Oracle spatial partitioning capability which uses spatial indexing to bin groups of features without clipping. (https://docs.oracle.com/database/121/SPATL/using-partitioned-spatial-indexes.htm#SPATL586).

Personally, whether data are tiled and clipped or not, I think that spatial indexing should be part of CDB-X. Spatial indexing would facilitate all use cases, such as analytics, by enabling quick determination of which tiles/partitions and features are required for processing a query without touching the content or traversing any LoD structure.