Add generic equations and mechanism to select them

teixeirak commented 5 years ago

We need to add the Chojnacky et al. 2014 equations to the sitespecies table, both to give users the option to go with the generic option and to handle issue #69.

Necessary steps: 1- @gonzalezeb, add the equations to the table. For site, put something like "Any" or "NA". For species, fill in specificity for which equation is designed, (e.g., "Picea spp.). 2- @maurolepore, we'll then need a mechanism in the code to identify and assign these equations. They should be selected when (1) they are are only equation available for the species at a given DBH or (2) when user selects the generic equation option.

maurolepore commented 5 years ago

Good!

Mostly as a note to myself, some useful code was in allo_order(). I removed it recently via this commit. I can easily revert it and build on top of that.

teixeirak commented 5 years ago

Agreed. I think we'll want that functionality, but user should be able to enter the command as to how to prioritize equations just once.

maurolepore commented 5 years ago

@teixeirak,

... For site, put something like "Any" or "NA". For species, fill in specificity for which equation is designed, (e.g., "Picea spp.).

It seems that generic equations are already associated to sites and species. I seems like a contradiciton (how is an equation generic and at the same time it is site and species specific) but it is what I need to match each species and site with an equation, so I don't complain -- just notice this as something that I need to understand better.

library(tidyverse)
#> Warning: package 'purrr' was built under R version 3.5.3

allodb::master_tidy() %>% 
  select(
    site, 
    equation_group, 
    species, 
    equation_id, 
    dependent_variable_biomass_component
  ) %>% 
  filter(equation_group == "Generic") %>% 
  unique()
#> Joining `equations` and `sitespecies` by 'equation_id'; then `sites_info` by 'site'.
#> # A tibble: 232 x 5
#>    site     equation_group species      equation_id dependent_variable_bio~
#>    <chr>    <chr>          <chr>        <chr>       <chr>                  
#>  1 lilly d~ Generic        Asimina tri~ ae65ed      Total aboveground biom~
#>  2 lilly d~ Generic        Carpinus ca~ ae65ed      Total aboveground biom~
#>  3 lilly d~ Generic        Celtis occi~ ae65ed      Total aboveground biom~
#>  4 lilly d~ Generic        Paulownia t~ ae65ed      Total aboveground biom~
#>  5 lilly d~ Generic        Rhus typhina ae65ed      Total aboveground biom~
#>  6 scbi     Generic        Ailanthus a~ ae65ed      Total aboveground biom~
#>  7 scbi     Generic        Asimina tri~ ae65ed      Total aboveground biom~
#>  8 scbi     Generic        Berberis th~ ae65ed      Total aboveground biom~
#>  9 scbi     Generic        Carpinus ca~ ae65ed      Total aboveground biom~
#> 10 scbi     Generic        Celtis occi~ ae65ed      Total aboveground biom~
#> # ... with 222 more rows

Created on 2019-03-21 by the reprex package (v0.2.1)

teixeirak commented 5 years ago

Here's the plan that we've worked out:

Erika has already reviewed the current species lists for temperate sites and assigned the best available equations, which in some cases are generic. There is value to keeping all of these links to generic equations in the sitespecies table because it indicates (1) that the species is present at the site and (2) the species has been reviewed, and it was determined that the generic equation was the best available option. Thus, what we currently have stays as is.

Generic equations can also be applied to any site within a specified region (e.g., temperate North America), including for stems at ForestGEO sites where the DBH is greater than the upper DBH limit of the expert-selected equation. For these, the site species table will contain records with: site = "any temperate NA" (or such) species = e.g., "Quercus sp." (any Quercus)

Note that for species that are specifically assigned a generic equation, the record in the sitespecies table is superfluous from a coding perspective. However, its important data from the perspective of someone who may want to pull up the list of species and associated allometries for a given site, and to disambiguate between species that have never been reviewed (e.g., if a new species shows up at a site) and those that have been reviewed but found to have no specifically appropriate allometry.

Does this make sense? If it would be useful, I could add an example to the sitespecies table.

maurolepore commented 5 years ago

Just to make sure we are on the same page. I assume that the information that distinguishes "Expert" from "Generic" equations is already encoded in the column sitespecies$equation_group.

library(tidyverse)
#> Warning: package 'purrr' was built under R version 3.5.3
library(allodb)

sitespecies %>% 
  filter(equation_group == "Expert") %>% 
  select(equation_group, everything())
#> # A tibble: 540 x 11
#>    equation_group site  family species species_code life_form equation_id
#>    <chr>          <chr> <chr>  <chr>   <chr>        <chr>     <chr>      
#>  1 Expert         Lill~ Sapin~ Acer r~ 316          Tree      7c72ed     
#>  2 Expert         Lill~ Sapin~ Acer r~ 316          Tree      2060ea     
#>  3 Expert         Lill~ Sapin~ Acer s~ 318          Tree      a4d879     
#>  4 Expert         Lill~ Rosac~ Amelan~ 356          Tree      c59e03     
#>  5 Expert         Lill~ Rosac~ Amelan~ 356          Tree      96c0af     
#>  6 Expert         Lill~ Rosac~ Amelan~ 356          Tree      529234     
#>  7 Expert         Lill~ Jugla~ Carya ~ 409          Tree      9c4cc9     
#>  8 Expert         Lill~ Jugla~ Carya ~ 402          Tree      9c4cc9     
#>  9 Expert         Lill~ Jugla~ Carya ~ 403          Tree      9c4cc9     
#> 10 Expert         Lill~ Jugla~ Carya ~ 407          Tree      9c4cc9     
#> # ... with 530 more rows, and 4 more variables: equation_taxa <chr>,
#> #   notes_on_species <chr>, wsg_id <chr>, wsg_specificity <chr>

sitespecies %>% 
  filter(equation_group == "Generic") %>% 
  select(equation_group, everything())
#> # A tibble: 232 x 11
#>    equation_group site  family species species_code life_form equation_id
#>    <chr>          <chr> <chr>  <chr>   <chr>        <chr>     <chr>      
#>  1 Generic        Lill~ Annon~ Asimin~ 367          Tree      ae65ed     
#>  2 Generic        Lill~ Betul~ Carpin~ 391          Tree      ae65ed     
#>  3 Generic        Lill~ Jugla~ Carya ~ <NA>         Tree      1c1ac8     
#>  4 Generic        Lill~ Canna~ Celtis~ 462          Tree      ae65ed     
#>  5 Generic        Lill~ Fabac~ Cercis~ 471          Tree      7f7777     
#>  6 Generic        Lill~ Rosac~ Cratae~ 500          Shrub     f08fff     
#>  7 Generic        Lill~ Laura~ Linder~ 609          Shrub     f08fff     
#>  8 Generic        Lill~ Paulo~ Paulow~ 712          Tree      ae65ed     
#>  9 Generic        Lill~ Rosac~ Prunus~ 762          Tree      f08fff     
#> 10 Generic        Lill~ Anaca~ Rhus t~ 899          Shrub     ae65ed     
#> # ... with 222 more rows, and 4 more variables: equation_taxa <chr>,
#> #   notes_on_species <chr>, wsg_id <chr>, wsg_specificity <chr>

.

Erika has already reviewed the current species ...

Generic equations can also be applied to any site within a specified region (e.g., temperate North America), including for stems at ForestGEO sites where the DBH is greater than the upper DBH limit of the expert-selected equation.

Just noticing that I don't yet see a column encoding region.

library(tidyverse)
#> Warning: package 'purrr' was built under R version 3.5.3
library(allodb)

master() %>% 
  select(matches("region"))
#> Joining `equations` and `sitespecies` by 'equation_id'; then `sites_info` by 'site'.
#> # A tibble: 769 x 0

.

For these, the site species table will contain records with: site = "any temperate NA" (or such) species = e.g., "Quercus sp." (any Quercus)

Okay, I'll have to work around missmatches. Remember that this is how the code allocates an equation from allodb to each dbh value in the user data: The user's data is matched exactly to allodb data via the keys species, site, and equation_id. I can massage your tables but just be aware that the more missmatches there are, the more complex the logic becomes (and the greater the change of bugs). For example:

I can convert site = "any" in allodb on the fly to become site = <the site in the census data>; that way, say, site = "scbi" in allodb will match site = "scbi" in the user's data.
"Genus species" in user's data won't match "Genus sp." in allodb. I can transform the string from allodb on the fly to match "genus" in allodb with "genus" on the user's data.

If on top of this you expect to use different kinds of generic equations, say "any-temperate", and "any-tropical", then the code needs to somehow know which species are temperate and which are tropical. That could take quite some time to do. .

Note that for species that are specifically assigned a generic equation, the record in the sitespecies table is superfluous from a coding perspective.

Can you say this again? I don't understand. There seems to be two kinds of generic equations. If this is the case, think if you can design the table in a way that is straight forward to code. The more ambiguity there is in a single variable (one column of a table in allodb encoding more than one thing) the harder it is to write readable and reliable code. .

However, its important data from the perspective of someone who may want to pull up the list of species and associated allometries for a given site, and to disambiguate between species that have never been reviewed (e.g., if a new species shows up at a site) and those that have been reviewed but found to have no specifically appropriate allometry.

How about you encode whether an equation id should not be used by the code? All I need is a 1-column table of the equations to skip.

equations_to_skip
"abc123"
"opq321"
...

.

Does this make sense? If it would be useful, I could add an example to the sitespecies table.

Thanks, an example will certainly help. Best is a minimal, fake example that captures the essense of what you want to convey.

maurolepore commented 5 years ago

I'll let this conversation develop a bit to better understand the next actions. For now, I'll drop the generic equaitons that shoudn't get mixed with expert equations (see https://github.com/forestgeo/fgeo.biomass/issues/28) which should immediately improve the accuracy of the biomass estimates we get.

teixeirak commented 5 years ago

Please don't drop anything we currently have; its all wanted (i.e., Erika has determined those to be the best options). I'll clarify more when I get a chance.

maurolepore commented 5 years ago

By drop I mean exclude from the calculation. I won't touch anything in allodb.

Right now, the biomass values are incorrectly summing biomass that comes from expert AND from generic equaitons. Only one of them should be used. And we don't still support generic equations, so I'll exclude them until we do -- meaning that the results we get right now are more correct.

# Now (incorrect)
rowid    site   species   dbh  equaiton  equation_group  biomass
1        "scbi" "Aaa aaa" 10   dbh * 10  "Generic"       100
1        "scbi" "Aaa aaa" 10   dbh * 10  "expert"        100
---
biomass result = 200

# Soon (correct)
rowid    site   species   dbh  equaiton  equation_group  biomass
1        "scbi" "Aaa aaa" 10   dbh * 10  "expert"        100
---
biomass result = 100

teixeirak commented 5 years ago

No, please don't exclude. The "Generic" equations are still expert-selected (i.e., identified by Erika as the best available). Do you have any real examples where a "Generic" and "Expert" equation are given for the same species at the same site, and at the same DBH? (I suppose there are some where max dbh of a small tree equation overlaps with min dbh of a large tree equation. Any others?) Perhaps rename these "generic" and "specific"? (But wait to see what Erika thinks.)

teixeirak commented 5 years ago

Here's an example of what we want (using a real example):

site	family	species	equation ID	equation_group	max DBH
scbi	Aceraceae	Acer rubrum	7c72ed	Expert / Specific	55
scbi	Aceraceae	Acer negundo	d6be5c	Generic	66
any.temperate.NA	Aceraceae	any	d6be5c	Generic	66

Here's how we want it to work in several cases: 1- 30 cm Acer rubrum at SCBI - use equation 7c72ed (row 1) 2- 30 cm Acer negundo at SCBI - use equation d6be5c (row 2) with knowledge that expert review has determined this to be the best available 3- 60 cm Acer rubrum at SCBI - use equation d6be5c (row 2) because there's no record of an appropriate equation for Acer rubrum >55cm at SCBI (but because there's a record for Acer rubrum at SCBI, we know it has been expert-reviewed) 4- 30 cm Acer newspecies (a species new to the plot in 2023 census) or Acer sp. (unidentified Acer) at SCBI- use equation d6be5c (row 3) because there are no records for the species in the sitespecies table 5- 30 cm Acer rubrum at [hypothetical future ForestGEO site in NA] - use equation d6be5c (row 3) because there are no records for the site in the sitespecies table (6- 70 cm Acer at any site -- TBD- its dangerous to extrapolate beyond limit of equation, but in some cases we'll have to.)

maurolepore commented 5 years ago

Do you have any real examples where a "Generic" and "Expert" equation are given for the same species at the same site, and at the same DBH?

Here is one example:

The row 236 of the user's data has a single tree of dbh = 143 that matches two equations in allodb, 7f7777 and 333c34. Right now, the code calculates the biomass for each row independently and then sums them together to produce the single biomass result for the rowid 236. This is approapriate when the two equations are for different parts of the same tree. But is not the case; here the resulting biomass (which sums bimass by rowid) will overestimate the real biomass.

The temporary approach I suggest is to forget about generic equations until we can handle them correctly. Here, the code would on the fly drop the row where is_generic is TRUE and we are left with only the Expert equaitons.

What do you think?

Full reprex with more examples

library(tidyverse)
#> Warning: package 'purrr' was built under R version 3.5.3
library(allodb)
library(fgeo.biomass)
set.seed(1)

census <- fgeo.biomass::scbi_tree1 %>% dplyr::sample_n(1000)
species <- fgeo.biomass::scbi_species
census_species <- census %>% add_species(species, site = "scbi")
#> Adding `site`.
#> Overwriting `sp`; it now stores Latin species names.
#> Adding `rowid`.
bad <- allo_find(census_species)
#> Assuming `dbh` in [mm] (required to find dbh-specific equations).
#> * Searching equations according to site and species.
#> Warning: Can't find equations matching these species:
#> carya sp, quercus prinus, ulmus sp, unidentified unk
#> * Refining equations according to dbh.
#> Warning: Can't find equations for 664 rows (inserting `NA`).

bad %>% 
  select(
    rowid, equation_id, 
    site, 
    sp, 
    dbh, 
    matches("dbh.*mm$"), 
    is_generic,
    anatomic_relevance
  ) %>% 
  add_count(rowid) %>% 
  filter(n > 1 & rowid %in% c("236", "811", "336")) %>% 
  select(-n)
#> # A tibble: 6 x 9
#>   rowid equation_id site  sp      dbh dbh_min_mm dbh_max_mm is_generic
#>   <int> <chr>       <chr> <chr> <dbl>      <dbl>      <dbl> <lgl>     
#> 1   236 7f7777      scbi  robi~ 143.        40        420   TRUE      
#> 2   236 333c34      scbi  robi~ 143.       142.       259.  FALSE     
#> 3   336 f08fff      scbi  prun~  37.6       30        640   TRUE      
#> 4   336 8aecca      scbi  prun~  37.6        3.7       68.3 FALSE     
#> 5   811 f08fff      scbi  sass~  37.3       30        640   TRUE      
#> 6   811 2c092b      scbi  sass~  37.3        4         84.9 FALSE     
#> # ... with 1 more variable: anatomic_relevance <chr>

Created on 2019-03-21 by the reprex package (v0.2.1)

teixeirak commented 5 years ago

This is a case, and there will be others like it, where there are two equations for different size classes that overlap in dbh range. This relates to issue #17, and ultimately I'd prefer to use the approach described there (i.e., switch equations at the point where they cross). Until that is done, please give precedence to the expert-selected equation.

teixeirak commented 5 years ago

Note that, in the long run, it won't be a stable solution to sum equations under the assumption that they describe different biomass components. Rather, we will need to create equations describing how the dependent_variable_biomass_components relate to one another and use that as the basis for summing.

maurolepore commented 5 years ago

... in the long run ... we will need to create equations describing how the dependent_variable_biomass_components relate to one another and use that as the basis for summing.

Sorry, I don't understand this, but you say it's now urgent so I'll let it sit for now.

teixeirak commented 5 years ago

I've made an issue (#82) to remind us of this later.

maurolepore commented 5 years ago

RE your https://github.com/forestgeo/allodb/issues/72#issuecomment-475373890

Awesome! Thanks for taking the time to develop an example. Its a great reminder of the basic logic you expect.

It should be clear by now, but I highlight that your comment (https://github.com/forestgeo/allodb/issues/72#issuecomment-475373890) describes decisions about different trees, whereas my example (https://github.com/forestgeo/allodb/issues/72#issuecomment-475378972) describes decisions about a single tree.

maurolepore commented 5 years ago

Following https://github.com/forestgeo/allodb/issues/73#issuecomment-476686185, here are the newly added generic-equations, and some comments and questions.

a. (https://github.com/forestgeo/fgeo.biomass/issues/31) site = "any temperate NA" will be converted on the fly to \<the current site>, e.g. if the data comes from SCBI, all values of site will be "scbi". This allows matching the equations by site.

b. Rows 3-10 require no action. They will be handled correctly once (https://github.com/forestgeo/fgeo.biomass/issues/31) is implemented. That is, the code already knows how to find equaitons for each row in a census dataset by matching allodb tables by site (see a.) and species.

c. Row 11 is also no problem. That equation will be used for every row of the census data containing "Abies sp." in the census dataset (i.e. when the user's ForestGEO census table has a code in sp that points to Abies sp. in the ForestGEO species table).

d. Rows 1-2 are problematic. How will the code decide which of the two to use? @teixeirak, is your idea to match by family? i.e. if a tree belongs to Fabaceae match 1, and if it belongs to Junglandaceae match 2?

library(allodb)
library(tidyverse)

sitespecies %>% 
  filter(str_detect(site, "any")) %>% 
  select(site, equation_group, family, species, equation_id, equation_group)
#> # A tibble: 11 x 5
#>    site             equation_group family       species         equation_id
#>    <chr>            <chr>          <chr>        <chr>           <chr>      
#>  1 any temperate NA Generic        Fabaceae     <NA>            7f7777     
#>  2 any temperate NA Generic        Juglandaceae <NA>            1c1ac8     
#>  3 any temperate NA Generic        Pinaceae     Abies balsamea  4872ed     
#>  4 any temperate NA Generic        Pinaceae     Abies fraseri   4872ed     
#>  5 any temperate NA Generic        Pinaceae     Abies lasiocar~ 4872ed     
#>  6 any temperate NA Generic        Pinaceae     Abies amabilis  74dd65     
#>  7 any temperate NA Generic        Pinaceae     Abies concolor  74dd65     
#>  8 any temperate NA Generic        Pinaceae     Abies grandis   74dd65     
#>  9 any temperate NA Generic        Pinaceae     Abies magnifica 74dd65     
#> 10 any temperate NA Generic        Pinaceae     Abies procera   74dd65     
#> 11 any temperate NA Generic        Pinaceae     Abies sp.       74dd65

Created on 2019-03-26 by the reprex package (v0.2.1)

teixeirak commented 5 years ago

Sorry for the delayed response. Somehow this landed in my junk folder.

Regarding (d), yes, the idea is to match by family. But I suppose you need a list of the genera in each family (filled in as in row 11)?

maurolepore commented 5 years ago

Thanks! Yes, you are right. The code can't know what species belong to which Family. If this means too much duplicated data, you may pull Family out of sitespecies into a table with columns species and Family, which I can join on the fly with sitespecies via the key column species.

library(tidyverse)

family <- tribble(
  ~family, ~species, 
  "Aea",    "A a",
  "Aea",    "A b",
  "Aea",    "A c",
)

sitespecies <- tribble(
  ~species, ~more_columns,
  "A a",    "whatever", 
  "A c",    "whatever", 
)

left_join(sitespecies, family)
#> Joining, by = "species"
#> # A tibble: 2 x 3
#>   species more_columns family
#>   <chr>   <chr>        <chr> 
#> 1 A a     whatever     Aea   
#> 2 A c     whatever     Aea

Created on 2019-03-27 by the reprex package (v0.2.1)

teixeirak commented 5 years ago

I'll leave this up to @gonzalezeb . There will be multiple of instances with these generic equations where we'll need to list the genera in a family. I bet there's some existing resource that we could source.

maurolepore commented 5 years ago

Oh, sure, you make me realize that I can find all species in a family from the species table, which the code requires anyway to match codes with speices names. You can disregard my previous message.

fgeo.biomass::scbi_species
#> # A tibble: 73 x 10
#>    sp    Latin Genus Species Family SpeciesID Authority IDLevel syn   subsp
#>    <chr> <chr> <chr> <chr>   <chr>      <int> <chr>     <chr>   <lgl> <lgl>
#>  1 acne  Acer~ Acer  negundo Sapin~         1 ""        species NA    NA   
#>  2 acpl  Acer~ Acer  platan~ Sapin~         2 ""        species NA    NA   
#>  3 acru  Acer~ Acer  rubrum  Sapin~         3 ""        species NA    NA   
#>  4 acsp  Acer~ Acer  sp      Sapin~         4 ""        Multip~ NA    NA   
#>  5 aial  Aila~ Aila~ altiss~ Simar~         5 ""        species NA    NA   
#>  6 amar  Amel~ Amel~ arborea Rosac~         6 ""        species NA    NA   
#>  7 astr  Asim~ Asim~ triloba Annon~         7 ""        species NA    NA   
#>  8 beth  Berb~ Berb~ thunbe~ Berbe~         8 ""        species NA    NA   
#>  9 caca  Carp~ Carp~ caroli~ Betul~         9 ""        species NA    NA   
#> 10 caco  Cary~ Carya cordif~ Jugla~        10 ""        species NA    NA   
#> # ... with 63 more rows

Created on 2019-03-27 by the reprex package (v0.2.1)

maurolepore commented 5 years ago

@teixeirak, is NA the most appropriate for rows 1-2 of the species column? I'm afraid its too inespecific. The process of cleaning data often involves removing missing values and duplicates. Cases where you mean "any species in family X" might be confused with other more general cases.

library(tidyverse)
library(allodb)

sitespecies %>% 
  filter(str_detect(site, "any")) %>% 
  select(site, family, species, equation_id)
#> # A tibble: 11 x 4
#>    site             family       species          equation_id
#>    <chr>            <chr>        <chr>            <chr>      
#>  1 any temperate NA Fabaceae     <NA>             7f7777     
#>  2 any temperate NA Juglandaceae <NA>             1c1ac8     
#>  3 any temperate NA Pinaceae     Abies balsamea   4872ed     
#>  4 any temperate NA Pinaceae     Abies fraseri    4872ed     
#>  5 any temperate NA Pinaceae     Abies lasiocarpa 4872ed     
#>  6 any temperate NA Pinaceae     Abies amabilis   74dd65     
#>  7 any temperate NA Pinaceae     Abies concolor   74dd65     
#>  8 any temperate NA Pinaceae     Abies grandis    74dd65     
#>  9 any temperate NA Pinaceae     Abies magnifica  74dd65     
#> 10 any temperate NA Pinaceae     Abies procera    74dd65     
#> 11 any temperate NA Pinaceae     Abies sp.        74dd65

Created on 2019-03-27 by the reprex package (v0.2.1)

teixeirak commented 5 years ago

Assuming I understand your question right, we could replace "NA" with "any".

gonzalezeb commented 5 years ago

Wait. I could get a little bit more specific on the 'species' column for some generic equations. I will need to look for more details in another table in chojnacky publication.

teixeirak commented 5 years ago

That's what I did for Abies (to differentiate by wood gravity). I don't think it matters for Fabaceae/Juglandaceae, but I was a bit confused by their categories on that one and would appreciate your review.

maurolepore commented 5 years ago

I just closed forestgeo/fgeo.biomass#31 (Convert site = any temperate NA to <current site>). This means that from https://github.com/forestgeo/allodb/issues/72#issuecomment-476770717 the only case that is still on the TODO list is that of rows 1-2 (and you are helping with that, so great!).

. From https://github.com/forestgeo/allodb/issues/72#issuecomment-476770717:

teixeirak commented 5 years ago

Note that we may run into some other issues as we continue entering Chojanski equations. In particular, I'm not sure how to deal with the "woodland" category (that will make sense to @gonzalezeb, not @maurolepore ).

gonzalezeb commented 5 years ago

Chojnacky 2014 noted that the woodland equations seemed to predict low biomass values, based likely on errors in the old data they used. As long as we specify this in our publication we can include them. These equations use drc (diameter at root collar) so I will include the diameter within the allodb equation.

gonzalezeb commented 3 years ago

Closing because this issue was solved by the new weighting system.

ropensci / allodb

Add generic equations and mechanism to select them #72

Full reprex with more examples