Deterministic Performance and CDB

sofwerx / cdb2-concept

CDB modernization

0 stars 1 forks source link

Deterministic Performance and CDB #8

Open cnreediii opened 4 years ago

cnreediii commented 4 years ago

Another issue. We need consensus on what we define determinism as. This definition will be documented in the ER.

From the Computer Science wiki:

Determinism -

Simply put, the same operation performed across different nodes should return the same result. ... Determinism also refers to the fact that the same operation replayed on a different node at a different point in time should also produce the same results.[2]

In computer science, a deterministic algorithm is an algorithm which, given a particular input, will always produce the same output, with the underlying machine always passing through the same sequence of states. Deterministic algorithms are by far the most studied and familiar kind of algorithm, as well as one of the most practical, since they can be run on real machines efficiently.

Let the discussion begin!

This is an important discussion as this may be (or may not be) a primary CDB 2.0 design objective. Seems to me in the modelling and simulation world, we really do want determinism. Otherwise, how do we defend the results from any modelling/simulation work flow??

ccbrianf commented 4 years ago

CDB 1.X defines determinism mostly in Volume 0: 6.9.3. Improved Client-device Robustness/Determinism in terms of loading/paging consistently fixed, or maximally sized (in bytes) LoD-tiles, independent of spatial location or LoD/resolution for all datasets. But the use of the term is often expanded to apply to other aspects of loading/paging performance and consistency, such as the following previously listed in #5:

Knowing exactly where to find the type and resolution of data for the geographic region an application needs without (much) filtering or searching, either on disk or in memory.
A guaranteed upper bound and generally constant Lod-tile size (in bytes, or some equivalent approximation such as points or verticies) of that data bundle, tailored to a commonly agreed upon "maximal processing sized quanta" unit so that resource availability can be assured and memory fragmentation is eliminated due to the fixed size unit. Ideally, the LoD-tile structure will be chosen such that the size limit is rarely a practical concern for typical data, except when the production tool packs LoDs for more efficient and consistent tile sizing (GS Model example) (Vector example).
A good approximation for the amount of time it will take to retrieve and process the LoD-tile information, which should be relatively uniform across spatial regions, and LoDs of the same data type.

christianmorrow commented 4 years ago

While current ('legacy?') CDB pays homage to determinism (I choose these words very deliberately) is CDB truly deterministic? Does the format's original design provide a reliable predictor of system performance? Maybe, maybe not.

What does CDB I/O actually look like? @ccbrianf, to what degree have you instrumented your run-time applications to analyze data access performance and patterns?

I am unaware of any specific correlation between CDB folder/file/tile structure and how data is actually stored on a disk drive. There is no 'CDB-fs'...with spatially aware physical data layout on a hard drive(s) or 'CDBdefrag' that physically optimizes the layout, with an eye to to 'normalizing' any particular CDB I/O operation.

When we copy a CDB repository from one disk to another (not a mirror, just using a file system 'copy' utility) the resulting copy will NOT necessarily yield the same I/O performance in operation as the source.

In a constant state of change, certainly over the past 15 years are:

CPU primary and secondary cache size
Bandwidth between RAM and CPUs
Size and speed of system RAM
Data transfer rates between disk and RAM
Size and speed of cache memory / read-ahead buffers in hard disks
Storage capacity of hard disk drives
Data transfer rate between CPU/system memory and GPU memory
Alternatives to hard disk drives -- SSD (SATA and M.2 interfaces)
GPU memory size and speed
Effective latency and transfer rates from remote storage over networks

As any of these variables are changed, the entire caching formula changes! What might constitute a deterministic outcome on one computer system, becomes a very different outcome on another. Has anyone accurately modeled CDB read performance, based on empirical measurement of these (and more) sub-systems' actual behaviors? And then, run the model against an existing CDB repository or a simulated repository? Then, compare modeled results with observed total system behavior?

Similarly the strategy for storing and reading data changes with these variables. If we're retrieving data from storage on a device with a large 'read ahead' buffer, we can mitigate disk seek overhead by 'clumping' related data together, and aligning the data with the buffer size - then strategically reading large blocks, even if we only need some of the data. If we're requesting data over a low-medium speed serial network, and we have little or no local storage (hand held device) then the way data is stored, retrieved and transmitted would be very different.

Achieving deterministic performance is more likely a function of analysis and platform optimization rather than intrinsically expressed in a storage format -- unless the format itself is tightly constrained to a tightly constrained hardware configuration!

In the real world, system performance, and its suitability for a particular purpose is governed by the quantity of content - which is derived from design decisions as to data resolution and density of features, with a particular target platform in mind. Regardless of how data is stored, these are always the practical drivers.

In summary -- if CDB-X is to be widely consumed by multiple platforms, we should give considerable thought to abstracting CDB-X away from a presumptive so-called 'deterministic optimized storage strategy' and give much thought to 'storage optimization profiles' directed at classes of platforms. I can think of a few top level classes:

fixed location device with local storage
fixed location device with networked storage
networked mobile device
stand-alone mobile device

Lastly, as a community, we should focus on tools and methods for predicting and measuring performance implications of different system configurations and storage strategies.

ccbrianf commented 4 years ago

With respect to the criteria I listed above, if built according to specification, yes, the current CDB is about as deterministic as you can get in these areas (please notice that I did not say optimal though).

Yes, we have extensively instrumented, parallelized, and optimized our current CDB application's file IO patterns and performance using recent technology tools and tweaks to get every once of performance out of certain OS's and hardware platforms. How data is stored on a disk drive (or more likely a RAID system) is of much less concern at the standard specification level than what are the IO calls that will have to be made to the OS by the application, which is a function of the file format. Certainly fragmentation, stripe size, RAID level, storage device technology, etc. are all important with respect to absolute system level application performance, but no specification should mandate that level of implementation. Please show me any one application, regardless of the disk technology and topology behind it, where a large number of (even buffered) read system calls outperforms a single call to retrieve the same data (unless they are sequential overlapping 64k calls on a brain dead older version of Windows SMB sharing if you want to get that detailed).

Most world wide datasets like CDB are of a size that they are still stored on some type of spinning disk platform due to pure cost scalability factors. That is in the process of change, but it doesn't materially change the relative performance of how CDB structures its data on disk. Solid state technologies certainly make the cost of an IO much less, but there is still a fixed overhead cost, even on the CPU with respect to the number of system calls necessary to retrieve the data, and also all the way through to the storage media and back.

What hasn't changed is the relative order of magnitude relations of performance: CPU cache is faster than RAM, RAM is faster than disk (no matter what type), local disk is faster than network disk, and retrieving compressed data from disk and uncompressing it, either on the fly during use, or into RAM, using performance sensitive decompression algorithms, or in some cases trans-coding to hardware native compression types such as GPU texture compression, is still the best way to get all the performance possible out of a certain set of hardware. I disagree that the entire formula changes, but generally only the coefficients for each component.

The outcome on any system for a particular dataset access is still deterministic if the file format design rules have correctly accounted for determinism. Obviously, the exact magnitude of performance is relative to all those factors you describe, but the maximal memory size, for instance, is absolutely deterministic.

If you store only the data you need together in a single contiguous file buffer and read it all at once, you can't be more optimal on any of the architectures you contrast above. If your data is scattered all over in the file format, it isn't going to perform as well regardless. These are simple principals that are always true. If the specification mandates scattered data, there is no way I can fix that without simply creating my own derivative format that is no long compliant with plug and play runtime interchange across my application domain, which was one of CDB's main goals. It also defeats the entire point of having sliced and diced data if it's not also contiguously located together as a single unit in the file format.

CDB 1.X solves your real world quantity of data problem with multi-resolution LoDs, of spatially co-located contiguous file data, with a bounded memory size. Your application then chooses, based on either a-prior i analysis of the platform, assumptions based on runtime probing of the platform, or continuously averaged actual performance data (which for a deterministic design should be relatively uniform across spatial regions and LoDs as I stated previously) what quantity of content it can afford to access/process and at what rate.

You seem to be confusing determinism, which is about minimizing standard deviation of performance factors such as memory allocation size, access latency, processing time, etc. with absolute system performance. You also don't seem to have much experience actually profiling CDB IO across a number of use cases, or with the concept of plug and play interoperability of a standard repository format. However, I have always stated that in the latter case if there is consensus that different application domains or runtime environments require significant deviations in storage format, it is fine to have profiles for them as long as we all recognize what we are giving up. So in that respect I'm in agreement with you. My personal opinion is that other than the maximal memory size quanta unit, all the factors I discussed with respect to determinism are generally shared by all domains of application that would prefer multi-resolution tiled data access. Bench-marking prototypes across multiple system architectures and application domains is still a must, but I'll be very surprised if that doesn't mostly prove my points if someone actually goes to that level of "why" analysis.

It would be helpful if you would speculate on how you would change the storage format profile for each one of your proposed system architectures. Other than possibly changing the memory size quanta unit and selecting what levels of LoD tile information to store locally in some cases, I don't see much difference in needs, and you haven't gotten to the details of what you would change other than abstractly asserting they would be different.

cnreediii commented 4 years ago

Whoa - Let's drop back to what a good definition for "determinism" is for 1.) computer science and then 2.) for CDB. In computer science (or mathematics), Determinism - the same operation performed across different nodes should return the same result. ... Determinism also refers to the fact that the same operation replayed on a different node at a different point in time should also produce the same results. In computer science, a deterministic algorithm is an algorithm which, given a particular input, will always produce the same output, with the underlying machine always passing through the same sequence of states. Deterministic algorithms are by far the most studied and familiar kind of algorithm, as well as one of the most practical, since they can be run on real machines efficiently.

Now, much of the discussion so far on CDB and determinism (I believe) has been conflating determinism with performance. While they could have a strong relationship, they are different concepts. For example, if I have a number of CDB requirements that state that a tile shall no more than X coordinates (or features) AND that all content for a tile shall be stored as contiguous disc blocks, these rules in a sense define a simple deterministic algorithm that indeed has real performance implications.

Maybe not the best example - but we do need to separate the discussions/concepts.

Also, I believe that the CDB 1.x file naming and storage model maps very nicely to infrastructures beyond UNIX and MS systems - such as to HADOOP/MapReduce.

And of course how one structures their storage probably should be a separate discussion from the CDB tile/LoD/filenaming model discussion.

jerstlouis commented 4 years ago

@cnreediii I believe the discussion is (and always was) specifically about deterministic performance, rather than deterministic results (though that might also be part of CDB's design goals). :) The title of this issue should reflect this.

cnreediii commented 4 years ago

:-) Indeed - However, the more general definition of determinism/deterministic is still applicable to out CDB discussions. All the rules (requirements) about storage and file naming in my mind can be implemented as a simple set of algorithms that always produce the same results.

Have a good weekend!

ryanfranz commented 4 years ago

I agree with the focus being on deterministic performance. If pure performance was needed, something like glTF would be fine, data packaged and ready for rendering. But to handle the SOCOM use case, we need to preserve the data repository use of CDB, while trying to keep as much determinism to the data as we can.

So I will try to throw some real world examples in here to see if this helps the discussion. This example is more focused on limiting the number of I/O calls needed to load data (one of Brian's focus areas), but can also apply to consistent data sizes.

In CDB 1.x, it takes one I/O operation to either find the data or tell that it doesn't exist. So best case is one I/O call to load data into memory. Worst case, it might take 4 I/O calls to load all the data from a CDB 1.x shapefile.
If CDB 1.x (or CDB X ) stores data in a zip container (for example), it might take one I/O to read the contents and one to read the data if the zip file is large enough, maybe one if the whole file is small enough to load into memory.
If CDB X stores data in a GeoPackage, there are good and poor ways of storing that data. During the GeoPackage Interoperability Experiment, we discovered that the number of tables, the size of the tables, the presence or lack of additional index tables, and even whether the data was string or integer affected performance and query speed. So for a decent sized GeoPackage, there might be multiple I/O operations to find the table, run a query to find the row(s) and load the data. Some of this can be cached, but if one is doing these operations on large GeoPackage files across a large area, you still end up with a number of IO calls.

It seems that moving to "modern" formats, with very useful new capabilities, can make it more difficult to configure for deterministic performance. In most cases, the new capabilities are really useful, so I am not saying to scrap the new stuff. Just be aware that it is sometimes more work to deal with the tradeoffs of modern formats that a standard needs to address. It has to be designed into the system. It seems easy to create a GeoPackage that would give awful performance. It took a bunch of experimentation in the Interoperability Experiment to design a GeoPackage that could operate as fast as a shapefile. If you care about performance of a system, then at some point we need to design for it.

I know that the tiling coverage group is recommending grouping a number of LODs together in a GeoPackage to optimize the file and table size and thus how fast data can be located and loaded, rather than trying to put every LOD and tile into a single GeoPackage. The grouping size would be configurable in the CDB for the use case, so that each case can handle the tradeoffs associated with larger or smaller files and tables.

One more example. Jpeg2000 imagery is horrible for decompression performance at the latency and throughput we need. But it is deterministic. There is a single I/O call to load the file or not find it. The file name allows us to not have to search a container to find data. There is a maximum data size (3MB uncompressed, so less than that). Given these characteristics, we can design hardware platforms to load faster, and/or modify the software to load the data much sooner than we otherwise would. Determinism allows us to accommodate the relatively slow decompression speeds. But if there was an unknown amount of searching needed to find the file, or the file didn't have a bound on the size, it becomes much harder to handle and possibly impossible to design a real-time graphics solution for. Great hardware will not make up for poor data layout and organization.

ccbrianf commented 4 years ago

@cnreediii, with all due respect, we can't resort to an outside definition of determinism when discussing CDB's definition. We have to define it the way CDB defined and uses that word in the existing specification, as I referenced by link directly, and indirectly through my summary (derived from searching for every use of that word in the existing specification) above, or at least that was the concept I was trying to discuss. If you want to argue that the word is used incorrectly,and another term should be used in future specifications instead, that's fine. But you can't change the meaning of the existing topical discussion as it relates to the existing CDB usage, or the reasoning behind those usage principals, especially without having the benefit of an extensive historical understanding of what the engineers who originally designed CDB intended.

I think you would have to try pretty hard to get a file/storage format to return a different result at a different time, or on a different node, given the same data :-). And about the only read side somewhat non-trivial (such as default read value) algorithm defined by CDB (that I can think of) is how to triangulate the terrain grid (which is often violated for better terrain fidelity at the cost of minor correlation discrepancies). I'm having a bit of trouble applying your definition to a file/storage format like CDB. I guess the "algorithm" to find a type of data in a particular spatial region at a particular resolution is mostly deterministic if that's one of the abstract applications of your definition, and it seems to be from your follow on comment.

Let's please use the term "contiguous file data" instead of disk blocks to avoid going down the system and storage architecture rabbit hole once again since that really isn't relevant to this discussion.

We (or maybe just I?) have always been talking about deterministic data distribution (data type/layer, spatial coherency, resolution/LOD consistency, and tile size maximum/uniformity), and then to a lesser extent relative consistency of data access performance due to those factors across differing geographic locations, data layers, and resolution/LoDs.

@jerstlouis, it's the deterministic qualities above that lead to deterministic performance. So while performance is a lot of the "why", it's not really the "what".

Table entry blobs are the only part of a geo-package that are most likely to be file contiguous, and thus one of the few aspects of performance the specification designer has any control over. If every feature in a tile is a table entry, you are almost guaranteed to have many I/Os to retrieve the data, and that will vary (non-deterministically) with factors such as how the package was created, whether a recent vacuum has occurred, how the tables are indexed, if the file is recreated by a copy like operation, etc. If the whole tile is one table entry blob as it conceptually is in CDB 1.X today as 1 file (zip of limited size or otherwise), or set of files for Shape :-P, then the only non-deterministic IO is in the btree or index lookups to find the table entry tile data blob (somewhat equivalent to today's file open operation) which can then be read with one I/O, and you can even theoretically set the sqlite cache size to accommodate the blobs specification bounded maximal (and also also average if LoD packing is employed) size to encourage that. This is a critical aspect of determinism that leads to performance in a way parallel to Ryan's description.

cnreediii commented 4 years ago

It is probably really good that we are having this discussion! Why? Well, because "determinism" in the CDB context is not defined in the glossary (Volume 3) and does not show up in Volume 1: Core until the discussion of MinElevation and MaxElevation components (Clause 5.6.1.6)!

Prior to that clause, the primary use of "determine" and related terms is in normative text such as:

This section provides the algorithm to determine the name of the directory at level 1 of the Tiles hierarchy.

@ccbrianf I agree that "We have to define it the way CDB defined and uses that word in the existing specification". This discussion hopefully provides us a definition we can use in CDB X. Further, and apologies for my poor logic above, but I think we need the general definition of the concept "determinism" in CDB X (as in the "algorithms and formulae defined in the current CDB) AND the more specific definition of the concept as used in CDB 1.x as related to deterministic performance.

This is important in terms of the normative content in CDB X that will be submitted to the OGC.

Hopefully I am not too off base - but I do need to understand so that the reader of the draft new spec also understands (this from my Editors role)

Thanks!

jerstlouis commented 4 years ago

@ccbrianf in case that was not clearly communicated, in the GeoPackage Vector Tiles Extensions developed in the Vector Tiles Pilot which we proposed to use (for use cases that do want the vector data to be tiled), a tile is one row in a tiles Table, and a blob column is your whole tile (encoded e.g. as Mapbox Vector Tiles, or an alternative compact format could be defined for it).

ccbrianf commented 4 years ago

@cnreediii Please remember determinism is discussed in Volume 0 too as linked above :-). (I'm sure it's just me, but I actually find all the separate volumes more cumbersome to use than the old monolithic one, because I'm always guessing which one some content was moved to and I haven't found a way to search them all easily, especially online.)

In the case of Min/Max elevation, determinism is the justification for what is essentially the duplication of data in that dataset that was otherwise against CDB principals. It was required for deterministic data access volumes (amount of data paged) and application performance when doing visibility calculations, so the duplication of data rule was relaxed to accommodate this (although I'm still not sure the algorithm described really works in ellipsoid relative altitude space correctly).

I agree we need both the general definition and the performance related aspects of determinism in CDB X.

@jerstlouis, I understand, but it seemed the vector geo-package working group was thinking otherwise until I had a similar discussion with them. They didn't understand the relationship between slice/dice tiles and contiguous file access, and I assume other's might not appreciate that yet as well. That group is still thinking individual table entries for a non-tiled features AFAIK (but we/I haven't met yet since).

The trick with Mapbox Vector Tiles may be defining how junctions work, both tile internal and between tiles, and the cost of having to re-clip due to the slop area around the actual tile boundary (while it hints at how to rejoin features, it makes actually reassembling, and sometimes also rendering them, somewhat more difficult). It does however seem to be one of the few such (especially NGA endorsed) vector tile standards available though. I guess Open Street Map tiles might be an alternative, but I haven't looked into that protocol buffer format in detail yet. Vectors aren't really my thing.

jerstlouis commented 4 years ago

@ccbrianf I also find the many volumes very cumbersome, and I agree, even more so for the split version, but I think it is because information necessary for implementing particular things (e.g. the basic CDB visualization) is spread out across volumes. If separate volumes dealt specifically with more advanced or specific capabilities pertaining to some use cases, it might be a lot better.

In a sense, I hope that the core + extension models will help CDB X address that problem.

I agree Mapbox Vector Tiles makes it difficult to re-join tiles with the tile border buffer approach, though we did implement that in our tools. In OGC Testbed 13, we improved and proposed the GNOSIS Map Tiles encoding (Appendix B in the ER, also more up to date spec as part of Testbed 14 CityGML & AR ER), in which border vertices are flagged to recognize artificial segments and facilitate re-assembly. We are also planning to further improve on this format this year, in terms of making it easier to implement and even more compact.

I was not aware of Open Street Map tiles, as an encoding? I thought MVT was mostly used for raw data OSM tiles.

ccbrianf commented 4 years ago

https://wiki.openstreetmap.org/wiki/PBF_Format I found mention of them reading the MVT format description here: https://docs.mapbox.com/vector-tiles/specification/#how-are-openstreetmap-pbf-files-related-to-mapbox-vector-tiles

jerstlouis commented 4 years ago

@ccbrianf The OSM PBF format is not intended for tiles at all. It is a PBF encoding of the OSM data (which also can be expressed in a much less compact XML format). It is the format in which you can retrieve the whole planet (now at 52GB) OSM database in a compact way. It has no way of expressing artificial edges at all, and in general I have found that many tools that extract PBF portions handle poorly any features crossing outside of the extent. OSM PBF is mainly intended for re-importing features in a spatial database.

OSM PBF stores features as per the OSM data model, which is comprised of Nodes, Ways and Relations. It is quite challenging to re-construct polygons and lines out of this, which needs to consider many things including data attributes to do so (we implemented that in our tools).

cnreediii commented 4 years ago

WRT CDB X document structure. I think - and we can ask the group in next weeks meeting - that CDB X shall be "integrated" into a more easily navigable document. We just need to be careful when we develop the draft spec to use the techniques that the OGC has defined and well developed for back and forth navigation of a document - even if the document is multi-volume. Using GitHub (or Lab) we can use links, anchors, etc to make life much easier for the reader/developer. Also, I think having everything related to the CDB X core together in a single document rather than scattered through multiple documents will make things much easier.

@ccbrianf - Historical note on tiled data stores, traversal across tile boundaries, feature "reconstruction" etc. Way back in the mid 1980's I led a team that designed and implemented a highly performant GIS whose spatial data store was 100% tiled. Any and all spatial data types used the same tiling structure. While each user community (such as a City) could define their own tiled data store, we could have easily replicated the CDB tiling structure. All vector data were topologically structured (edge/node/faces). The model was based on work by the UK Mil Survey, DMA Mark 90, and a product called WAMS. Any edge that crossed a tile boundary was clipped at the boundary and an artificial node called an "edge node" was created. Non of the MapBox silliness :-) The edge node record contained information about what the next tile was and what the edge ID in the next tile was. No lookup necessary. The edge node information combined with other metadata allowed for very rapid feature reconstruction - say for a polygon that was contained in or crossed dozens (or hundreds) of tiles. The did not have to know anything about the underlying tiled structure. They only "saw" whole features. And when everything was coupled with a variety of spatial indexes search and query performance was super, super fast. Visualization could be slow as there was no such thing as GPUs, glTFs, protobufs and the internet was quite slow :-) Every GIS function worked against the tiled data store. The only exception was a linked graph that could be generated for transportation networks to optimize routing.

The reason for sharing this is to understand that there are algorithms and techniques for tiled data stores that have been matured and widely used in operational environments for decades. In addition to CDB knowledge, we can leverage that broader industry know how and knowledge in CDB X

@jerstlouis - I think we need to stop using the "slice and dice" terminology. Makes things sounds as if we are taking clean data and creating a mixed salad that does not reflect the original data. I think we should use "clipped" or "clipping". Just a thought. Slice and dice sounds really unprofessional!

jerstlouis commented 4 years ago

@cnreediii I fully agree about slice and dice, sorry for using it. It does sound appetizing in a culinary context though ;) I would even use the word partition (or simply tiled) to make it clear that this can be a lossless operation.

Thank you for sharing this @cnreediii, as you know we have implemented something very similar in our tools and in the data store and format which I linked earlier from OGC Testbed 13 - Vector Tiles. What you're describing is in a sense very much how the DoD Vector Product Format is set up as well, and again I think this approach entirely still relevant today, and superior to the Mapbox visualization-oriented approach, though more complex to implement.

cnreediii commented 4 years ago

@jerstlouis Thanks! And partitioning or spatial partitioning is good - and consistent with guidance from the OAB and Architecture DWG in terms of their comments on the about to be approved Tiling Abstract Spec!

ccbrianf commented 4 years ago

Glad to hear we can help make CDB X documents more navigable!

@jerstlouis Sorry for not doing my homework and making assumptions about OSM PBF format based on the discussion placement in the MVT documentation, but I did make the appropriate disclaimers :-).

@cnreediii Your system sounds very similar to CDB 1.X with the exception of the special edge nodes (for non-network vectors; network vectors have junction ids in CDB which are somewhat similar since which tile the feature is continued in deterministic based on edge location).

There is an inherent limit to how fast you can reconstruct a feature across hundreds of tiles ;-), but that's part of why CDB has multi-resolution LoDs and feature priorities to often try and avoid that need.

I'm not sure I understand what all the spatial indexes were in your approach (I probably don't need to), and I don't doubt it was "super fast" (especially for the time), but was it deterministically/similarly fast for operations on differing datasets, at different spatial locations, and at differing resolutions? That's a large part of our discussion. Heavy use of spatial indexing can cause that to go either way depending on the structure of the underlying indexed data.

You sound like you are arguing that tiled data can be used effectively for whole feature analysis tasks because reassembly can be made straight forward and efficient, which is good because I believe the same. It's interesting though that you did have the routing network exception because that is one point of contention others have with the tiled approach. The thing about the visualization use case is that it usually prefers direct access to the tiled data, usually without reassembly, except for routing.

I totally agree that CDB is probably not unique with respect to tiled GIS data stores and algorithms, although those other areas are not my area of expertise. I do imagine that the most successful ones are fairly IO optimal and deterministic though for the majority of their use cases, which is my main point that I'm trying to keep us from abandoning, at least in a runtime performance CDB X profile.

I'm afraid the "slice and dice" term propagation is my fault. It seems to be Mr. Bentley's favorite term for the tiled/derived dataset so I've used it by reference when trying to make sure we are all on the same page with confining that to a profile, and not giving it up entirely.

One question seems to be, is there an existing vector tile standard that handles what we need (such as efficient reassembly, network junctions both within and between tiles, and compact contiguous storage :-)), or do we need to create a new one, which might be tough to settle into on this short a timeline and with such limited resources.

cnreediii commented 4 years ago

@ccbrianf - I believe we are pretty much on the same page!

WRT routing, we did that work for FedEx. They required routing and navigation information in seconds in a multi-user environment. This in the mid 1990's when the web was young, bandwidth was an interesting struggle, and . . . you catch my drift :-) I suspect the same is true today, primarily due to multi-user being thousands of requests per minute.

WRT other analytics, interestingly enough the majority of traditional GIS operations, such as overlay, buffering, distance searches, vector to raster, raster to vector, and so on work just fine in a tiled environment. No need to actually reconstruct features for most types of analysis. There are obvious exceptions, such as highly performant on-line routing and navigation.

WRT Spatial Indexes - I concur. A very typical example of using the right tools for the job at hand. If one has a highly deterministic tiled structure such as CDB 1.x, the overuse of spatial indexes is probably a waste of time. Along those lines, I am awaiting some performance stats for the use of R-Trees in large GeoPackages.

jerstlouis commented 4 years ago

@ccbrianf @cnreediii In the routing engine we developed for the Open Routing Pilot, we did have to re-construct features from their tiled data store of line features to compute the roads network. However tiles potentially could allow you to do partial updates if only one tile and segments of roads are updated. A routing network is stored and used quite differently from a line features data store, and the network should probabaly be updated along with the data. Our engine was designed with very high performance in mind and our plan this year is to test it on the global planetary OSM dataset for continent-wide routing, and assess the performance of re-construction and potentially updates as well for applying deltas to that huge 53 GB (in compact .osm.pbf) dataset.

Regarding a vector tile format, as mentioned earlier we developed one before and during OGC Testbed 13, and it is the backend data store to our GNOSIS Map Server, which can efficiently re-combine and re-tile on the fly to another tiling scheme, and would very much like to further improve it and see it adopted as an OGC standard (either as a community or regular standard), and we are planning to invest significant resources into this in the near future within the next few months.

jerstlouis commented 4 years ago

@cnreediii regarding GeoPackage spatial indexes performance, in our GNOSIS Data Store, and in the GeoPackages we generated in VTP2, we index (and store) the overall extent of the features in an R-Tree. We did not use actual ST_min or ST_max custom functions for SQLite to execute. I wonder why a simple R-tree of would suffer from those non-linear performance issues.

Such a spatial index is useful to find where a certain feature is: you can perform a localized search per attribute (avoiding searching millions of feature over the whole world), and/or zoom to a feature per its attribute without knowing where it is. So spatial indexes and tiles can definitely be complementary.

I am curious about what the actual SQL statements to create those tables and index them are, and what mechanics are being triggered -- e.g. those ST_min / max function, and what used to provide an implementation of them in the tests... SpatiaLite? Custom code written in Java? A feature's extent should only be calculated once when adding it, so that should definitely be linear.

christianmorrow commented 4 years ago

General questions to discuss:

To what extent is performance / determinism the motivation for storing vector data in tiles? Is performance the primary factor?
Do all vector datasets have to be stored using the same scheme - tiled vs. non-tiled?
Do LODs in vector network datasets make sense?

@ccbrianf mentioned my earlier comments did not include specific proposals / alternatives....so...

Changes to consider:

Increase the size of raster data tile files from 1024 x 1024 to a larger size - provide higher resolution at lower LODs, with fewer I/O operations. Are larger files perhaps more compressible? Consider making raster tile file size configurable in global scope, or on a per-geotile basis
allow/encourage internal tiling in raster data tile files (especially if files are larger than CDB 1.x)
Increase the maximum number of points per tile from 16,384 to a larger number. Consider making the maximum number of points per tile configurable (as above)
Put more information about models' characteristics on their instancing points, providing a runtime publisher with more information to use when determining what to render vs. cull. Reduce/delay 3D model read I/O, until absolutely necessary.
(this might be a bit avant-garde...) add standardized attribution that run-time publishers can 'write-back' to a CDB to preserve 'relative load time' or 'relative render time' type information. This information could be written during real-time, or during a pre-process analysis phase. Perhaps useful when iterating over a large CDB; perhaps very useful after making hardware changes on a fielded system.

These suggestions don't break with the 'plug-and-play' philosophy - certain platforms may favor organizing data one way over another; and conversion/optimization software may be helpful when moving from one platform to another.

Does it make sense to stick with JP2000?

Do we have time to experiment with OGC GML in lossy or lossless alternatives: FLIF? BPG? WebP? JPEG XR? AVIF? What are the trade-offs for storage size on disk, I/O performance, CPU resource consumption between lossy and lossless - compressed or uncompressed imagery - in the context of CDB-X. Just a thought for discussion -- I think we all agree that JP2000 is a performance bottleneck...cost of disk storage isn't the problem it was 15 years ago.

ccbrianf commented 4 years ago

@christianmorrow

Maybe it's just my biased viewpoint, but I'm having trouble coming up with any reason for storing a dataset in tiles that doesn't directly relate back to scalability, determinism, and ultimately performance. I welcome other viewpoints/inputs to broaden my perspective. Obviously if that is the case, there may be reasons not to store data in tiles for specific application domains (primarily editing, in my mind).
My opinion is that all datasets should be available using the same conceptual level tiling, or possibly also non-tiling, data models. There may be cases such as routing networks that may require additional schemes, but I think what is common should be as standard as possible, for consistency, ease of use, and because it is likely what makes sense for one makes sense for most. That said, this is a conceptual model, and thus at that level the rules may be generic enough to allow for what might appear to be considerable physical and logical model differences while still remaining consistent. In that light, most all of my input thus far is not intended to be specific to vector datasets, but rather to be more general principals that apply to all datasets.
In CDB 1.X at least, the network portion of the data is stored together with the vector portion of the data. So in one aspect at least, if you agree it makes sense to have generalized (I'm trying to use a less offensive term than vector LoD, and less specific term than map scale) versions of vectors available, then I don't see why that wouldn't also apply to a dataset that just happens to have network information/attribution as well. In any case, we need to ask how many use cases there are for this kind of data, what is the cost to those use cases if that data doesn't exist in that form versus what is the inherent cost to produce and maintain such possibly redundant data, should it be optional or required, and for which application domain profiles?

Change comments:

Others such as @jerstlouis have already questioned the 1k tile size and suggested 256 instead (presumably for mobile device compatibility?), so increasing that may be controversial. That said, depending on the file format chosen, we might be conflating file size with file internal tile size in specific encoding cases which could differ substantially, and compression plays a large role in actual IO size. I agree that for general purpose computing, especially if we are talking about rotational media, that IO in the range of 4-16 Mb is probably optimal. But I don't have much mobile device experience, and that might be too large for bandwidth contstained streaming applications. That is why I previously stated that the tile size quanta seems to be a prime candidate to fork into application domain profiles. There is generally a point of diminishing returns to increased compression efficiency with file size, especially for imagery, so I wouldn't expect too much return on the increase proposed, but I would welcome data to the contrary. I would strongly discourage arbitrary configurable raster tile size changes across spatial regions, unless you are talking about handling latitude compression. That could force applications into runtime spatial resampling over large areas in cases such as virtual texturing using uniform mappings.
I support considering internal tiling for raster data. I believe geo-package would be one option, although that may not be along the lines you had in mind.
Maximum number of points per tile: I think this goes right back to the application domain specific tile size quanta discussion. The number of points is purely a proxy for the size in bytes to assure a maximal and relatively consistently packed IO and memory allocation unit. I don't agree with spatial differences in this limit, but rather application domain usage profiles sound reasonable.
What properties do you think are missing from model point features? Bounding box information, vertex count, and significant size is present. That's most of the data necessary to do what you are suggesting from my point of view for visualization short of texture size maybe, which is independent of geometry in CDB 1.X.
Since the write back information you suggest is entirely application specific, I would not recommend it be covered by the standard. I wouldn't preclude any application from doing so if that is acceptable to its customer community who made the CDB specification requirement.

JP2k: I believe the tiling sub-group is discussing alternate imagery file formats (and I heard about a specific discussion of FLIF), but lossless compression seems strongly preferred in that group and to support data responsibility use cases. Ironically, JP2k supports some of the large file and internal tiling schemes you suggest that are completely unused in current CDB, but it also has the obvious performance draw backs you mention. I would support a few well chosen format options per application domain profile though. For visualization, Basis Universal seems to make a lot of sense, but not so much for a data repository.

lilleyse commented 4 years ago

For JP2000 - the Cesium team has been looking into KTX2 as a next-gen imagery format as part of our SOCOM work - see Cesium Milestone 1 Report.pdf for a survey of KTX2 + Basis Universal. KTX2 also supports uncompressed integer and floating point data types but we haven't compared with JP2000 as carefully.

As a side note - is there a good place for imagery discussion on github? I don't want to derail this conversation too much since this is now getting a bit off topic.

MichalaH commented 4 years ago

@lilleyse feel free to start a new issue for discussion items that need a space.

jerstlouis commented 4 years ago

In our Tiles group, we discovered http://flif.info/ which seems like a very good alternative to JP2000.

cnreediii commented 4 years ago

@ccbrianf and @christianmorrow - WRT tile matrix size for imagery, the OGC, OSGeo, and commercial industry (Google, MS, DG and others) have a huge knowledge base and experience with looking at optimal performance for the number of pixels in x and y in a tile. I believe that this is where Jerome is coming from when he suggests have 256x256 pixel tiles. @jerstlouis - Hope I have that correct :-) Yes, I believe that mobile is a big driver. AGC has funded considerable OGC standards work on experimenting and prototyping tiled content in a DDIL environment. Given SOF as a potential big user, I would guess any knowledge of performance in a DDIL environment would be good. So I suspect there are lessons to be learned there.

So, I would recommend that perhaps a little research might be appropriate before we decide on any changes (and if so what changes) to the tile size requirements for imagery in a CDB data store.

cnreediii commented 4 years ago

@lilleyse - I took a quick look at KTX2. The spec states, "The KTX file format, version 2 is a format for storing textures for GPU applications". This may not be appropriate for the general compression of any imagery that might be stored in a CDB data store. From the FLIF web site:

The Free Lossless Image Format (FLIF) is a lossless image compression format. It supports grayscale, RGB and RGBA, with a color depth from 1 to 16 bits per channel. Both still images and animations are supported. FLIF can handle sparse-color images (e.g. 256 color palettes) effectively. It has an interlaced and a non-interlaced mode; unlike PNG, interlacing usually does not come with worse compression. In terms of compression, FLIF usually outperforms other lossless compression formats like PNG, lossless WebP, BPG, JPEG 2000, JPEG XR, JPEG-LS, APNG, MNG, or GIF.

I think that the CDB X data store should accommodate the use of any image compression format but that after experiments, such as an OGC Interop Initiative, we would be in a much better position to recommend a best practice for image compression.

ccbrianf commented 4 years ago

@cnreediii While there might be web and mobile streaming industry agreement on optimal raster tile size, I'm not quite as confident that the exact same tile size is used on the back end for those data formats and server IOs. Please let me know if you are confident of that as well. As long as those 256 tiles aren't a file by themselves, but rather tiles within a much larger file format context, I'm not opposed to the change though. FWIW, our IG also uses that tile size for paging purposes internally (but it's configurable), and it has for longer than most of the other commercial technologies you cite have been around ;-). However, our CDB server still used the CDB 1.X 1k size for caching and disk IO efficiencies.

I strongly oppose accommodating any image compression format (unless that's just conceptual for future expansion) in the CDB X specification because that's a moderately high burden, especially for a performance sensitive or resource constrained application to handle. If my customer states I have to be CDB compliant, they don't care that the data producer used a crazy one off expensive to decompress format in their CDB. They just want their fast jet simulator to fly it at full resolution without load latency or other artifacts, and it's currently the contractor's responsibility to make that happen somehow without changing the CDB data. I can see a similar situation for a cell phone app in a different memory constrained dimension. IMHO, we need a few well chosen formats with different strengths and profiles for their usage.

ryanfranz commented 4 years ago

Moved imagery discussion to https://github.com/sofwerx/cdb2-concept/issues/14