Settle on data table structure

wright13 commented 3 years ago

How do we want the data to be formatted for use in this package? E.g. columns present (should each data table have Unit_Code, other columns from Visit?), column name format (I think this is already pretty consistent). Another way to think of this is how should a csv export of the raw data look?

@wright13 will work on getting more familiar with the data @jakegross808 will let Sarah know which views/queries (view = saved SQL query, basically) to look at

jakegross808 commented 3 years ago

uploaded old .R script for connecting to SQL database (connection path removed) incase script or table structure is helpful. old_SQL_FTPC_database_script.r

wright13 commented 3 years ago

That script is helpful! So looking at the data export you sent me, my understanding is that...

Lg_Trees and Sm_Trees are filtered versions of the same dataset
Seedlings is a subset of the data in Sm_Woody_Tally
Species_coverage_High and Species_coverage_Low are species/life form presence in high and low understory tiers

Correct me if I'm wrong on any of that. Are there any other exports in the old script that we want to include in the package? Table structure wouldn't necessarily be identical, just thinking about compiling a list of data tables that we want to have.

wright13 commented 3 years ago

Gathering my thoughts ahead of our call - no need to respond to this before we chat :)

Which event- and site-level columns should be included in all data tables?
- Unit_Code
- Sampling_Frame
- Community
- Plot_Number
- Plot_Type (assuming we may want to filter on this?)
- QA_Plot (do we need to filter on this? How are QA plots different?)
- Start_Date
- End_Date
- Verified
- Certified
What species columns should be included in data tables? Additional info can always be pulled in from Species lookup, but it's usually helpful to at least have a column for species code, and maybe scientific name, nativity, anything else that will be used often.
- Species code
- Sci name
- Family
- Nativity
- Life form
How do we want to break the data up into tables? The two approaches that come to mind are one table per parameter (as listed in the background doc) or one table per sampling method. Or somewhere in between. Table list:
- Site
- Plot
- Event
- Species
Should data table wrangling happen in the databases (with views/queries) or in R?
- In R
Do we want to set the R package up to read in all the data from the start? Or pick a few tables to begin with and add on later?

wright13 commented 3 years ago

To do:

[ ] Sarah will set up package to read from db
[ ] Jake will come up with a list of data tables and write some code to select appropriate columns from data tables

jakegross808 commented 3 years ago

FTPC database documentation with relationship diagram

focal_terr_plants_db_documentation_20110131.pdf

jakegross808 commented 3 years ago

Fig.1 in that doc shows the relationship. These are the Tables needed:

tbl_Lg_Woody_Individual tbl_Tree_Canopy_Height tbl_Presence tbl_Sm_Woody_Tally tbl_Understory_Cover tbl_Woody_Debris

Would it be easiest to join/relate all associated tables and then pare down columns?

wright13 commented 3 years ago

I think I would pare down the columns first, just to reduce the amount of data that we have to retrieve from the database. You'll probably have to keep the ID columns until after joining though. Either way will work fine though!

jakegross808 commented 3 years ago

ok I'm going through each table in relationship diagram and using select() to choose columns like this: tbl_Lg_Woody_Individual <- tbl(DB, "tbl_Lg_Woody_Individual") %>% select(Large_Woody_ID, Event_ID, Species_ID, Life_Form, Quad, Status, Height, Height_Dead, Boles, DBH, DBH_Basal, Vigor, Fruit_Flower, Rooting, Foliar, Caudex_Length, Shrublike_Growth, Resprouts, Measurement_Type)

jakegross808 commented 3 years ago

Or is it better to just show the columns being dropped maybe??

tbl_Lg_Woody_Individual <- tbl(DB, "tbl_Lg_Woody_Individual") %>% select(-Sort_Order, -SSMA_TimeStamp)

jakegross808 commented 3 years ago

I guess that could screw things up though if a column we don't want suddenly got added to table.

jakegross808 commented 3 years ago

added "table_structure.R" to repository. It contains select() columns for each data table (and sub-data table, if needed)

Still need to do tbl_Events, tbl_Plots, tbl_Locations, tbl_Site, tlu_Species, and xref_Park_Species_Nativity.

jakegross808 / pacn-veg-package

Settle on data table structure #7