ncss-tech / soilDB

soilDB: Simplified Access to National Cooperative Soil Survey Databases
http://ncss-tech.github.io/soilDB/
GNU General Public License v3.0
81 stars 20 forks source link

`createSSURGO()` should create indices for tables/columns commonly used by soilDB queries #317

Closed brownag closed 9 months ago

brownag commented 10 months ago

Following from discussion here: https://github.com/ncss-tech/soilDB/discussions/312#discussioncomment-7603784

createSSURGO() creates primary keys for each table but that is it. Many of the soilDB SDA/SSURGO queries involve grouped operations on values at e.g. mapunit or component level.

There are some low-hanging-fruit that we can apply indices to columns that are used in several tables: such as mukey, cokey, chkey, and also probably some more exotic stuff like mrulename, nationalmusym or areasymbol (which aren't frequently used in the workhorse queries, but often are used in a first-step query in soilDB and likely would benefit from either use of a numeric key in the query itself, or generally from additional index). Likely there are several cases where queries should be refactored to better utilize either primary, or additional, keys when present.

SSURGO Portal uses a template database, which allows many of the foreign key relationships to be explicitly included in the database before it is built. In contrast, createSSURGO() builds from data.frame information (derived from SSURGO .txt files) without any concept of a "template". Borrowing CREATE TABLE statements that define the Portal schema, and setting up the tables and relationships before inserting data in createSSURGO() and we could get better parity with SSURGO Portal databases. Note however that some SSURGO Portal tables have fewer columns--need to sort that out.

brownag commented 9 months ago

This (adding indices) is done in https://github.com/ncss-tech/soilDB/commit/b97d4c8b4819d04edd326b7ea42b0bcce46c7074

I think creating template databases, and going down the path of creating the custom table creation statements, is a different, more involved issue that is possibly out of scope for soilDB. It represents a fundamental difference in design between SSURGO Portal and createSSURGO(). A benefit of current createSSURGO() is we are not locked in to any specific schema nor need for file--based templates.

dylanbeaudette commented 9 months ago

Good call:

I do think that a short tutorial on building a custom DB from createSSURGO() is worth the effort and I'm happy to help.