manaakiwhenua / dggs-lu-tex

Latex files for paper submitted to Big Earth Data: "Using a DGGS for a scalable, interoperable, and reproducible system of land-use classification"
0 stars 0 forks source link

Page 9 Line 22-28 #3

Closed alpha-beta-soup closed 3 months ago

alpha-beta-soup commented 3 months ago

Is this a performance algorithm issue? Some details should be provided. Again, compared with Line 29-34, some pseudo code to show the difference in algorithm design would be helpful.

alpha-beta-soup commented 3 months ago

Thank you for this question; it addresses an important part of our discussion.

P8 L22-28 is the beginning of our discussion section, where we take a moment to establish from the literature that column-oriented data stores are more efficient for certain access patterns compared to row-oriented data stores.

We then wrote (L25-28):

For a realistic land-use classification rule, consider a query that requires three or four bands of a multiband sensor, and only a few census variables (of the hundreds that may be available). The execution of such a query could immediately avoid reading the majority of the available information when using a column-oriented data store.

Is this a performance algorithm issue?

We take this to be asking whether this benefit is really a product of the use of appropriate algorithms for data retrieval.

A query that requires Bands 1, 2 and 4 (from, say, 10 possible bands) of a multiband raster image (but now indexed to a DGGS) and perhaps population count, median age, and area from a large census (vector dataset) is efficient to read when stored in a column-oriented format because data for columns is stored together on disk. It is not necessary to read any data from the other seven image bands, or the large number of other census variables, in order to retrieve these data for the purposes of performing a classification/map using the selected data (columns) across all rows (i.e. across an entire region or country for which data exists).

The reason to opt for a column-oriented data store is to benefit scanning and aggregate functions performed over large portions (or the entirety) of a few columns of a large table. This is precisely how land use queries tend to work: a lot of input data, but for any given class, only a relative handful of columns are scanned; and the order is not important (unlikely to require pair-wise distance calculations).

In a row-oriented data store, the use of indexes could provide a comparable benefit to column-orientation in this case, but comes at the cost of computing and storing that index. Where queries are dynamic it is not always possible to know which columns would benefit from indexing. In the absence of indexes, row-oriented data for "wide" tables requires a lot of I/O scanning through irrelevant data.

Some details should be provided.

Again, compared with Line 29-34, some pseudo code to show the difference in algorithm design would be helpful.

L29-34 continue our discussion of the benefits of column-oriented data (which is how we have proposed organising geographic data indexed in a DGGS). It addresses the compression of data; if data are written in column-orientation, and partitioned in space, and given the assumed presence of spatial autocorrelation in land use data, then it stands to reason that adjacent blocks of data are identical. In simpler terms, the label "forest" will very often occur next to the label "forest" for two adjacent DGGS cells; and they will have a similar canopy height; and they belong to the same property parcel. Data organised in this way is very efficiently compressed. Indeed, if the data is sorted by one of the columns, then that particular column will be super-compressible (run-length encoding).

However, we have not proposed any new algorithms for compressing data, or reading compressed data, or uncompressing data---so there is no particular algorithm to describe.

Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos and Samuel Madden (2013), "The Design and Implementation of Modern Column-Oriented Database Systems", Foundations and Trends® in Databases: Vol. 5: No. 3, pp 197-280. http://dx.doi.org/10.1561/1900000024

Our argument here reflects the views expressed in Abadi et al. (2013), who has published extensively on the topic of column-oriented databases, and whose work we have already cited. In particualr, pp232-238 of Abadi et al. (2013) is an entire section on the compression of data in row-oriented data stores, including several citations to compression algorithms for use in a column-store.